HalfHumanDraft Subscribe
44 posts 13 dashboards

The C Revelation

Sorting · Part 2 of 7

What happened when Python overhead disappeared and the real algorithm showed up. (SafeSlot-C: #4 of 12, MicroSSS: #11 of 12, CAS-Core: #10 of 12 in the unified benchmark)

Series: 1 · Disorder Clusters | 2 · The C Revelation | 3 · Who Decided What Sorted Is | 4 · Why Look When You Can Just Do | 5 · Where the Thread Ends | 6 · Sieve Sort | 7 · The Pendry Sort Paper


In Python, Safe-Slot Sort beat sorted() by 2–3x on nearly-sorted data. Useful, but modest. Then I ported it to C.

18x faster than qsort. Wins up to 70% corruption. 50 lines.

This post is the story of what happened when Python's overhead disappeared, what broke when adversarial inputs fought back, how the algorithm evolved through four versions to handle anything thrown at it, and what three tenets crystallized out of the whole process. Two additional algorithms emerge here: Micro SSS, the tenet-distilled minimal core, and CAS-Core, the same algorithm recognized as insertion sort. The benchmark data tells us which situations each one actually wins.

Three algorithms: This post introduces three algorithms that end up being the same core with different wrapping. SafeSlot-C uses windowed qsort fallback. MicroSSS and CAS-Core are pure insertion sort with reverse detection. The benchmark treats them separately because the windowing changes real-world behavior.

The Port

The C implementation is straightforward. Scan for inversions, build a mask around each one, collect the masked elements, sort them, put them back. Same algorithm as Python, minus the interpreter.

// Copyright 2026 HalfHuman Draft - Pendry, S
// Licensed under the Apache License, Version 2.0

void safe_slot_sort(int64_t* arr, size_t n) {
    if (n <= 1) return;

    // Presort check
    int desc = 0, samp = 0, step = (int)(n / 150);
    if (step < 1) step = 1;
    for (size_t i = 0; i < n - 1 && samp < 150; i += step) {
        if (arr[i] > arr[i + 1]) desc++;
        samp++;
    }
    if (samp > 0 && (double)desc / samp > 0.8)
        for (size_t i = 0; i < n / 2; i++) {
            int64_t t = arr[i]; arr[i] = arr[n-1-i];
            arr[n-1-i] = t;
        }

    // Windowed inversion repair (100 iteration cap)
    for (int iteration = 0; iteration < 100; iteration++) {
        // ... scan, mask, collect, qsort, place back
    }
}

Same idea. Completely different numbers.

The C Numbers

The original post benchmarked against qsort at 10M elements and showed 18.7x on 0.01% localized corruption. The unified benchmark at 1M elements against 12 algorithms gives us a fuller picture. On its home turf, SafeSlot-C is competitive. On sorted data, 1.5ms against qsort's 53.2ms gives a 35x advantage. On local swap patterns up to 10%, it ranges from 5.4ms to 7.5ms while qsort sits at 52–53ms. The gap is real.

But the unified benchmark also reveals the full cost of the approach. SafeSlot-C is tested on only 19 of 35 patterns. On the other 16, it either produces unsorted output or takes catastrophically long. That's worse than any other algorithm in the C benchmark except the three that share its insertion-sort core.

Fairness note: The original 18.7x was against qsort at 10M. qsort has function pointer overhead per comparison. A hand-inlined introsort would be faster. The 35x at 1M against qsort in the unified benchmark reflects the same overhead. Against introsort directly, the advantage narrows but persists on structured data.

SafeSlot-C: Unified Benchmark at 1M Elements

Safe-Slot Sort in C. The windowed inversion repair approach with presort detection and 100-iteration cap. 19 of 35 patterns tested.

The Sweet Spot: Nearly-Sorted Data

Pattern SafeSlot-C qsort introsort Pendry
Sorted 1.5ms 53.2ms 16.2ms 1.6ms
Plateau 1.6ms 27.2ms 22.2ms 1.6ms
LocalSwap 0.1% 5.4ms 52.2ms 16.1ms 2.1ms
LocalSwap 1% 6.8ms 52.5ms 16.2ms 2.2ms
LocalSwap 10% 7.5ms 52.6ms 16.6ms 2.6ms
Localized 0.1% 4.9ms 52.3ms 16.2ms 1.7ms
Scatter 0.1% 6.0ms 52.3ms 16.2ms 2.2ms

The Cliff: Increasing Disorder

Pattern SafeSlot-C qsort Result
Scatter 5% 10.3ms 55.8ms Still winning 5.4x
Scatter 25% 10.3ms 55.4ms Still winning 5.4x
Scatter 100% 10.1ms 55.0ms Still winning 5.4x
Localized 25% 50.2ms 78.6ms Winning 1.6x

Note: Scatter patterns above 5% appear flat (~10ms) because the windowed repair with qsort fallback converges quickly. The repair cost scales with window overlap, not raw disorder percentage.

What It Can't Do (N/A Patterns)

Pattern Reason
Reverse, Random, Pipe Organ, Sawtooth, Two Blocks, Outlier, Organ+Noise Unsorted output or non-convergence within 100 iterations
Heavy Dupes, Narrow Range, Binary, Ternary, Gaussian, Zipf, Wide, Huge, Clustered Shuffled-value patterns trigger quadratic behavior

When It Broke

The adversarial patterns from the Python version still failed in C. Faster failure is still failure.

Pattern Result
Single element displaced across array NOT SORTED after 100 iterations
Circular shift by 100 NOT SORTED after 100 iterations
Sawtooth (tooth=1000) NOT SORTED, 9x slower
1000-element reversed chunks NOT SORTED, 8.7x slower

The diffusion problem is language-independent. An element that needs to travel 10 million positions at 8 positions per iteration will never get home in 100 iterations.

Key lesson: The diffusion problem is fundamental to any windowed repair strategy. The window bounds how far an element can travel per iteration. This limitation eventually led to Pendry Sort's multi-path architecture, where the window is just one of several tools.

The Evolution

The algorithm went through four versions to handle adversarial inputs. Each version made different tradeoffs.

Version 1: Bailout

Detect when struggling, fall back to qsort. Three heuristics: stall detection (inversions not decreasing), grind detection (tiny ineffective fixes), and coverage detection (fixing 60%+ of the array means just sort everything). Result: every adversarial pattern now passes. Maximum slowdown dropped from 9x to 1.74x over qsort.

Version 2: Adaptive

Instead of giving up on hard patterns, detect the pattern and adapt. Far displacement gets gap-sized windows. Periodic patterns get spacing-matched windows. Stalling triggers window tripling. Added hierarchical sectioning: after the first full scan, only rescan "dirty" sections.

Version 3: Lean

Memory and overhead optimization. Lazy allocation for the gaps array (only when far displacement is detected). Range-tracked mask clearing instead of full memset per iteration. Total memory at 1M elements dropped from ~25 MB to ~17 MB.

Version Lines Adversarial Structured Philosophy
Base ~50 Fails 18x faster Simple
Bailout ~100 Bails to qsort 18x faster Pragmatic
Adaptive v1 ~140 1.7–3x slower 3–4x faster Adaptive
Adaptive v2 ~220 1.3–1.7x slower 4x+ faster Complete
Lean (v3) ~220 1.3–1.7x slower 4x+ faster Efficient

The Minimal Core

Following the tenets to their logical conclusion produces something that already exists. That's not a cop-out. It's the point.

// 7 lines. The tenets, implemented.
void sort(int* a, int n) {
    for (int i = 0; i < n - 1; i++)
        if (a[i] > a[i+1]) {
            int v = a[i+1], j = i;
            while (j >= 0 && a[j] > v) { a[j+1] = a[j]; j--; }
            a[j+1] = v;
        }
}

It's insertion sort. Walk left to right. If the current element is in order, move on: structure exists, exploit it by doing nothing. If it's not in order, slide it backward to its correct position: fix locally. On nearly-sorted data, almost everything hits the continue path and the whole thing runs in O(n). On random data, you get O(n²). Work scales with disorder.

The practical version, Micro SSS, adds a 50-sample reverse detection check. Twenty lines. No malloc. On local adjacent swaps it never loses to qsort at any array size. But on scattered random displacement, it goes from 15ms to catastrophe at just 1% corruption.

Nomenclature: Micro SSS and CAS-Core are the same algorithm. CAS (Consequence-as-Action Sort) gets its own name and philosophical framing in Post 4. In the benchmark they're listed separately because they were developed at different points in the series.

Micro SSS / CAS-Core: The Tenet Sorts

The insertion sort core with reverse detection. Tested on 9 of 35 patterns at 1M elements (N/A on everything that goes quadratic).

Where the Tenets Dominate

Pattern MicroSSS CAS-Core qsort introsort Pendry
Sorted 1.5ms 1.3ms 53.2ms 16.2ms 1.6ms
All Identical 1.5ms 1.3ms 4.7ms 28.7ms 1.6ms
Plateau 1.5ms 1.3ms 27.2ms 22.2ms 1.6ms
LocalSwap 0.1% 1.5ms 1.3ms 52.2ms 16.1ms 2.1ms
LocalSwap 1% 1.6ms 1.4ms 52.5ms 16.2ms 2.2ms
LocalSwap 5% 1.8ms 1.6ms 52.6ms 16.5ms 2.5ms
LocalSwap 10% 1.9ms 1.7ms 52.6ms 16.6ms 2.6ms
Localized 0.1% 1.9ms 1.7ms 52.3ms 16.2ms 1.7ms
Localized 1% 43.7ms 43.6ms 52.9ms 16.7ms 2.4ms

CAS-Core takes 7 wins in the benchmark. On sorted/plateau/identical data, it runs at 1.3ms, which is 40x faster than qsort and 12x faster than introsort. But Localized 1% reveals the cliff: 43.6ms, nearly 20x the cost of 0.1%, because the corruption region exceeds what insertion sort handles efficiently.

The Tenets

Somewhere during all this iteration, the principles underneath the algorithm became clearer than the algorithm itself.

  1. Structure exists; exploit it. Don't assume your data is random. Most data in real systems isn't. Arrays that were sorted yesterday and had a few updates overnight are mostly sorted. If you treat all input as maximally chaotic, you're doing work that doesn't need doing.
  1. Find inversions; fix locally. When something is broken, fix that thing. Don't tear the whole structure down and rebuild it. If one book is out of place on a shelf, you don't dump every book on the floor and re-shelve them alphabetically.
  1. Work scales with disorder. A nearly-sorted array should take nearly no work. A fully random array should take full work. The algorithm's effort should be a function of how much is actually wrong, not how much data exists.

These aren't about sorting. They're about a philosophy of repair. Don't assume everything is broken. Find what actually is. Fix it locally. Let the cost of fixing reflect the cost of what broke.

The sorting algorithm is just where I happened to test it first. But by this point, the tenets had started suggesting something else entirely. What if comparing wasn't the only way to sort? What if the data already knew where it belonged, and we just had to listen?

Beyond sorting: The tenets apply to manufacturing (find drifted stations, don't rebuild the line), data validation (find bad records, don't re-validate everything), and mathematical frameworks (find paradoxes, don't abandon the axiom system). They are design principles, not sorting principles.

Classification

Three algorithms from one post, ranked against all 12 in the unified benchmark:

Algorithm Total (tested) Wins Tested Ranking Best Case Worst Case
SafeSlot-C 194,628 µs 1 19/35 #4 of 12 O(n) sorted Unsorted output
CAS-Core 55,067 µs 7 9/35 #10 of 12 O(n) sorted O(n²) random
MicroSSS 56,675 µs 0 9/35 #11 of 12 O(n) sorted O(n²) random

CAS-Core and MicroSSS have excellent per-test times but are only testable on 9 of 35 patterns. Their total of ~55ms across 9 patterns is misleading without that context. SafeSlot-C handles more patterns (19) but pays for it in overhead from the windowing and qsort fallback machinery.

The honest takeaway: All three excel on exactly one class of input: data that is already mostly in order with local perturbations. For anything else, they need help. The rest of this series builds that help.


Previously: Disorder Clusters — The original insight, Python implementation, and its limits.

Next: Who Decided What Sorted Is? — What happens when you stop comparing and start counting.