M.Algorithms

The C Revelation

April 6, 2026 · 7 min read · Pendry, S

Sorting · Part 2 of 7

What happened when Python overhead disappeared and the real algorithm showed up. (SafeSlot-C: #4 of 12, MicroSSS: #11 of 12, CAS-Core: #10 of 12 in the unified benchmark)

In Python, Safe-Slot Sort beat sorted() by 2–3x on nearly-sorted data. Useful, but modest. Then I ported it to C.

18x faster than qsort. Wins up to 70% corruption. 50 lines.

This post is the story of what happened when Python's overhead disappeared, what broke when adversarial inputs fought back, how the algorithm evolved through four versions to handle anything thrown at it, and what three tenets crystallized out of the whole process. Two additional algorithms emerge here: Micro SSS, the tenet-distilled minimal core, and CAS-Core, the same algorithm recognized as insertion sort. The benchmark data tells us which situations each one actually wins.

Three algorithms: This post introduces three algorithms that end up being the same core with different wrapping. SafeSlot-C uses windowed qsort fallback. MicroSSS and CAS-Core are pure insertion sort with reverse detection. The benchmark treats them separately because the windowing changes real-world behavior.

The Port

The C implementation is straightforward. Scan for inversions, build a mask around each one, collect the masked elements, sort them, put them back. Same algorithm as Python, minus the interpreter.

// Copyright 2026 HalfHuman Draft - Pendry, S
// Licensed under the Apache License, Version 2.0

void safe_slot_sort(int64_t* arr, size_t n) {
    if (n <= 1) return;

    // Presort check
    int desc = 0, samp = 0, step = (int)(n / 150);
    if (step < 1) step = 1;
    for (size_t i = 0; i < n - 1 && samp < 150; i += step) {
        if (arr[i] > arr[i + 1]) desc++;
        samp++;
    }
    if (samp > 0 && (double)desc / samp > 0.8)
        for (size_t i = 0; i < n / 2; i++) {
            int64_t t = arr[i]; arr[i] = arr[n-1-i];
            arr[n-1-i] = t;
        }

    // Windowed inversion repair (100 iteration cap)
    for (int iteration = 0; iteration < 100; iteration++) {
        // ... scan, mask, collect, qsort, place back
    }
}

Same idea. Completely different numbers.

The C Numbers

The original post benchmarked against qsort at 10M elements and showed 18.7x on 0.01% localized corruption. The unified benchmark at 1M elements against 12 algorithms gives us a fuller picture. On its home turf, SafeSlot-C is competitive. On sorted data, 1.5ms against qsort's 53.2ms gives a 35x advantage. On local swap patterns up to 10%, it ranges from 5.4ms to 7.5ms while qsort sits at 52–53ms. The gap is real.

But the unified benchmark also reveals the full cost of the approach. SafeSlot-C is tested on only 19 of 35 patterns. On the other 16, it either produces unsorted output or takes catastrophically long. That's worse than any other algorithm in the C benchmark except the three that share its insertion-sort core.

Fairness note: The original 18.7x was against qsort at 10M. qsort has function pointer overhead per comparison. A hand-inlined introsort would be faster. The 35x at 1M against qsort in the unified benchmark reflects the same overhead. Against introsort directly, the advantage narrows but persists on structured data.

SafeSlot-C: Unified Benchmark at 1M Elements

Safe-Slot Sort in C. The windowed inversion repair approach with presort detection and 100-iteration cap. 19 of 35 patterns tested.

The Sweet Spot: Nearly-Sorted Data

Pattern	SafeSlot-C	qsort	introsort	Pendry
Sorted	1.5ms	53.2ms	16.2ms	1.6ms
Plateau	1.6ms	27.2ms	22.2ms	1.6ms
LocalSwap 0.1%	5.4ms	52.2ms	16.1ms	2.1ms
LocalSwap 1%	6.8ms	52.5ms	16.2ms	2.2ms
LocalSwap 10%	7.5ms	52.6ms	16.6ms	2.6ms
Localized 0.1%	4.9ms	52.3ms	16.2ms	1.7ms
Scatter 0.1%	6.0ms	52.3ms	16.2ms	2.2ms

The Cliff: Increasing Disorder

Pattern	SafeSlot-C	qsort	Result
Scatter 5%	10.3ms	55.8ms	Still winning 5.4x
Scatter 25%	10.3ms	55.4ms	Still winning 5.4x
Scatter 100%	10.1ms	55.0ms	Still winning 5.4x
Localized 25%	50.2ms	78.6ms	Winning 1.6x

Note: Scatter patterns above 5% appear flat (~10ms) because the windowed repair with qsort fallback converges quickly. The repair cost scales with window overlap, not raw disorder percentage.

What It Can't Do (N/A Patterns)

Pattern	Reason
Reverse, Random, Pipe Organ, Sawtooth, Two Blocks, Outlier, Organ+Noise	Unsorted output or non-convergence within 100 iterations
Heavy Dupes, Narrow Range, Binary, Ternary, Gaussian, Zipf, Wide, Huge, Clustered	Shuffled-value patterns trigger quadratic behavior

When It Broke

The adversarial patterns from the Python version still failed in C. Faster failure is still failure.

Pattern	Result
Single element displaced across array	NOT SORTED after 100 iterations
Circular shift by 100	NOT SORTED after 100 iterations
Sawtooth (tooth=1000)	NOT SORTED, 9x slower
1000-element reversed chunks	NOT SORTED, 8.7x slower

The diffusion problem is language-independent. An element that needs to travel 10 million positions at 8 positions per iteration will never get home in 100 iterations.

Key lesson: The diffusion problem is fundamental to any windowed repair strategy. The window bounds how far an element can travel per iteration. This limitation eventually led to Pendry Sort's multi-path architecture, where the window is just one of several tools.

The Evolution

The algorithm went through four versions to handle adversarial inputs. Each version made different tradeoffs.

Version 1: Bailout

Detect when struggling, fall back to qsort. Three heuristics: stall detection (inversions not decreasing), grind detection (tiny ineffective fixes), and coverage detection (fixing 60%+ of the array means just sort everything). Result: every adversarial pattern now passes. Maximum slowdown dropped from 9x to 1.74x over qsort.

Version 2: Adaptive

Instead of giving up on hard patterns, detect the pattern and adapt. Far displacement gets gap-sized windows. Periodic patterns get spacing-matched windows. Stalling triggers window tripling. Added hierarchical sectioning: after the first full scan, only rescan "dirty" sections.

Version 3: Lean

Memory and overhead optimization. Lazy allocation for the gaps array (only when far displacement is detected). Range-tracked mask clearing instead of full memset per iteration. Total memory at 1M elements dropped from ~25 MB to ~17 MB.

Version	Lines	Adversarial	Structured	Philosophy
Base	~50	Fails	18x faster	Simple
Bailout	~100	Bails to qsort	18x faster	Pragmatic
Adaptive v1	~140	1.7–3x slower	3–4x faster	Adaptive
Adaptive v2	~220	1.3–1.7x slower	4x+ faster	Complete
Lean (v3)	~220	1.3–1.7x slower	4x+ faster	Efficient

The Minimal Core

Following the tenets to their logical conclusion produces something that already exists. That's not a cop-out. It's the point.

// 7 lines. The tenets, implemented.
void sort(int* a, int n) {
    for (int i = 0; i < n - 1; i++)
        if (a[i] > a[i+1]) {
            int v = a[i+1], j = i;
            while (j >= 0 && a[j] > v) { a[j+1] = a[j]; j--; }
            a[j+1] = v;
        }
}

It's insertion sort. Walk left to right. If the current element is in order, move on: structure exists, exploit it by doing nothing. If it's not in order, slide it backward to its correct position: fix locally. On nearly-sorted data, almost everything hits the continue path and the whole thing runs in O(n). On random data, you get O(n²). Work scales with disorder.

The practical version, Micro SSS, adds a 50-sample reverse detection check. Twenty lines. No malloc. On local adjacent swaps it never loses to qsort at any array size. But on scattered random displacement, it goes from 15ms to catastrophe at just 1% corruption.

Nomenclature: Micro SSS and CAS-Core are the same algorithm. CAS (Consequence-as-Action Sort) gets its own name and philosophical framing in Post 4. In the benchmark they're listed separately because they were developed at different points in the series.

Micro SSS / CAS-Core: The Tenet Sorts

The insertion sort core with reverse detection. Tested on 9 of 35 patterns at 1M elements (N/A on everything that goes quadratic).

Where the Tenets Dominate

Pattern	MicroSSS	CAS-Core	qsort	introsort	Pendry
Sorted	1.5ms	1.3ms	53.2ms	16.2ms	1.6ms
All Identical	1.5ms	1.3ms	4.7ms	28.7ms	1.6ms
Plateau	1.5ms	1.3ms	27.2ms	22.2ms	1.6ms
LocalSwap 0.1%	1.5ms	1.3ms	52.2ms	16.1ms	2.1ms
LocalSwap 1%	1.6ms	1.4ms	52.5ms	16.2ms	2.2ms
LocalSwap 5%	1.8ms	1.6ms	52.6ms	16.5ms	2.5ms
LocalSwap 10%	1.9ms	1.7ms	52.6ms	16.6ms	2.6ms
Localized 0.1%	1.9ms	1.7ms	52.3ms	16.2ms	1.7ms
Localized 1%	43.7ms	43.6ms	52.9ms	16.7ms	2.4ms

CAS-Core takes 7 wins in the benchmark. On sorted/plateau/identical data, it runs at 1.3ms, which is 40x faster than qsort and 12x faster than introsort. But Localized 1% reveals the cliff: 43.6ms, nearly 20x the cost of 0.1%, because the corruption region exceeds what insertion sort handles efficiently.

The Tenets

Somewhere during all this iteration, the principles underneath the algorithm became clearer than the algorithm itself.

Structure exists; exploit it. Don't assume your data is random. Most data in real systems isn't. Arrays that were sorted yesterday and had a few updates overnight are mostly sorted. If you treat all input as maximally chaotic, you're doing work that doesn't need doing.

Find inversions; fix locally. When something is broken, fix that thing. Don't tear the whole structure down and rebuild it. If one book is out of place on a shelf, you don't dump every book on the floor and re-shelve them alphabetically.

Work scales with disorder. A nearly-sorted array should take nearly no work. A fully random array should take full work. The algorithm's effort should be a function of how much is actually wrong, not how much data exists.

These aren't about sorting. They're about a philosophy of repair. Don't assume everything is broken. Find what actually is. Fix it locally. Let the cost of fixing reflect the cost of what broke.

The sorting algorithm is just where I happened to test it first. But by this point, the tenets had started suggesting something else entirely. What if comparing wasn't the only way to sort? What if the data already knew where it belonged, and we just had to listen?

Beyond sorting: The tenets apply to manufacturing (find drifted stations, don't rebuild the line), data validation (find bad records, don't re-validate everything), and mathematical frameworks (find paradoxes, don't abandon the axiom system). They are design principles, not sorting principles.

Classification

Three algorithms from one post, ranked against all 12 in the unified benchmark:

Algorithm	Total (tested)	Wins	Tested	Ranking	Best Case	Worst Case
SafeSlot-C	194,628 µs	1	19/35	#4 of 12	O(n) sorted	Unsorted output
CAS-Core	55,067 µs	7	9/35	#10 of 12	O(n) sorted	O(n²) random
MicroSSS	56,675 µs	0	9/35	#11 of 12	O(n) sorted	O(n²) random

CAS-Core and MicroSSS have excellent per-test times but are only testable on 9 of 35 patterns. Their total of ~55ms across 9 patterns is misleading without that context. SafeSlot-C handles more patterns (19) but pays for it in overhead from the windowing and qsort fallback machinery.

The honest takeaway: All three excel on exactly one class of input: data that is already mostly in order with local perturbations. For anything else, they need help. The rest of this series builds that help.

Previously: Disorder Clusters — The original insight, Python implementation, and its limits.

Next: Who Decided What Sorted Is? — What happens when you stop comparing and start counting.