The C Revelation
Sorting · Part 2 of 7
What happened when Python overhead disappeared and the real algorithm showed up. (SafeSlot-C: #4 of 12, MicroSSS: #11 of 12, CAS-Core: #10 of 12 in the unified benchmark)
Series: 1 · Disorder Clusters | 2 · The C Revelation | 3 · Who Decided What Sorted Is | 4 · Why Look When You Can Just Do | 5 · Where the Thread Ends | 6 · Sieve Sort | 7 · The Pendry Sort Paper
In Python, Safe-Slot Sort beat sorted() by 2–3x on nearly-sorted data. Useful, but modest. Then I ported it to C.
18x faster than qsort. Wins up to 70% corruption. 50 lines.
This post is the story of what happened when Python's overhead disappeared, what broke when adversarial inputs fought back, how the algorithm evolved through four versions to handle anything thrown at it, and what three tenets crystallized out of the whole process. Two additional algorithms emerge here: Micro SSS, the tenet-distilled minimal core, and CAS-Core, the same algorithm recognized as insertion sort. The benchmark data tells us which situations each one actually wins.
Three algorithms: This post introduces three algorithms that end up being the same core with different wrapping. SafeSlot-C uses windowed qsort fallback. MicroSSS and CAS-Core are pure insertion sort with reverse detection. The benchmark treats them separately because the windowing changes real-world behavior.
The Port
The C implementation is straightforward. Scan for inversions, build a mask around each one, collect the masked elements, sort them, put them back. Same algorithm as Python, minus the interpreter.
// Copyright 2026 HalfHuman Draft - Pendry, S
// Licensed under the Apache License, Version 2.0
void safe_slot_sort(int64_t* arr, size_t n) {
if (n <= 1) return;
// Presort check
int desc = 0, samp = 0, step = (int)(n / 150);
if (step < 1) step = 1;
for (size_t i = 0; i < n - 1 && samp < 150; i += step) {
if (arr[i] > arr[i + 1]) desc++;
samp++;
}
if (samp > 0 && (double)desc / samp > 0.8)
for (size_t i = 0; i < n / 2; i++) {
int64_t t = arr[i]; arr[i] = arr[n-1-i];
arr[n-1-i] = t;
}
// Windowed inversion repair (100 iteration cap)
for (int iteration = 0; iteration < 100; iteration++) {
// ... scan, mask, collect, qsort, place back
}
}
Same idea. Completely different numbers.
The C Numbers
The original post benchmarked against qsort at 10M elements and showed 18.7x on 0.01% localized corruption. The unified benchmark at 1M elements against 12 algorithms gives us a fuller picture. On its home turf, SafeSlot-C is competitive. On sorted data, 1.5ms against qsort's 53.2ms gives a 35x advantage. On local swap patterns up to 10%, it ranges from 5.4ms to 7.5ms while qsort sits at 52–53ms. The gap is real.
But the unified benchmark also reveals the full cost of the approach. SafeSlot-C is tested on only 19 of 35 patterns. On the other 16, it either produces unsorted output or takes catastrophically long. That's worse than any other algorithm in the C benchmark except the three that share its insertion-sort core.
Fairness note: The original 18.7x was against qsort at 10M. qsort has function pointer overhead per comparison. A hand-inlined introsort would be faster. The 35x at 1M against qsort in the unified benchmark reflects the same overhead. Against introsort directly, the advantage narrows but persists on structured data.
SafeSlot-C: Unified Benchmark at 1M Elements
Safe-Slot Sort in C. The windowed inversion repair approach with presort detection and 100-iteration cap. 19 of 35 patterns tested.
The Sweet Spot: Nearly-Sorted Data
| Pattern | SafeSlot-C | qsort | introsort | Pendry |
|---|---|---|---|---|
| Sorted | 1.5ms | 53.2ms | 16.2ms | 1.6ms |
| Plateau | 1.6ms | 27.2ms | 22.2ms | 1.6ms |
| LocalSwap 0.1% | 5.4ms | 52.2ms | 16.1ms | 2.1ms |
| LocalSwap 1% | 6.8ms | 52.5ms | 16.2ms | 2.2ms |
| LocalSwap 10% | 7.5ms | 52.6ms | 16.6ms | 2.6ms |
| Localized 0.1% | 4.9ms | 52.3ms | 16.2ms | 1.7ms |
| Scatter 0.1% | 6.0ms | 52.3ms | 16.2ms | 2.2ms |
The Cliff: Increasing Disorder
| Pattern | SafeSlot-C | qsort | Result |
|---|---|---|---|
| Scatter 5% | 10.3ms | 55.8ms | Still winning 5.4x |
| Scatter 25% | 10.3ms | 55.4ms | Still winning 5.4x |
| Scatter 100% | 10.1ms | 55.0ms | Still winning 5.4x |
| Localized 25% | 50.2ms | 78.6ms | Winning 1.6x |
Note: Scatter patterns above 5% appear flat (~10ms) because the windowed repair with qsort fallback converges quickly. The repair cost scales with window overlap, not raw disorder percentage.
What It Can't Do (N/A Patterns)
| Pattern | Reason |
|---|---|
| Reverse, Random, Pipe Organ, Sawtooth, Two Blocks, Outlier, Organ+Noise | Unsorted output or non-convergence within 100 iterations |
| Heavy Dupes, Narrow Range, Binary, Ternary, Gaussian, Zipf, Wide, Huge, Clustered | Shuffled-value patterns trigger quadratic behavior |
When It Broke
The adversarial patterns from the Python version still failed in C. Faster failure is still failure.
| Pattern | Result |
|---|---|
| Single element displaced across array | NOT SORTED after 100 iterations |
| Circular shift by 100 | NOT SORTED after 100 iterations |
| Sawtooth (tooth=1000) | NOT SORTED, 9x slower |
| 1000-element reversed chunks | NOT SORTED, 8.7x slower |
The diffusion problem is language-independent. An element that needs to travel 10 million positions at 8 positions per iteration will never get home in 100 iterations.
Key lesson: The diffusion problem is fundamental to any windowed repair strategy. The window bounds how far an element can travel per iteration. This limitation eventually led to Pendry Sort's multi-path architecture, where the window is just one of several tools.
The Evolution
The algorithm went through four versions to handle adversarial inputs. Each version made different tradeoffs.
Version 1: Bailout
Detect when struggling, fall back to qsort. Three heuristics: stall detection (inversions not decreasing), grind detection (tiny ineffective fixes), and coverage detection (fixing 60%+ of the array means just sort everything). Result: every adversarial pattern now passes. Maximum slowdown dropped from 9x to 1.74x over qsort.
Version 2: Adaptive
Instead of giving up on hard patterns, detect the pattern and adapt. Far displacement gets gap-sized windows. Periodic patterns get spacing-matched windows. Stalling triggers window tripling. Added hierarchical sectioning: after the first full scan, only rescan "dirty" sections.
Version 3: Lean
Memory and overhead optimization. Lazy allocation for the gaps array (only when far displacement is detected). Range-tracked mask clearing instead of full memset per iteration. Total memory at 1M elements dropped from ~25 MB to ~17 MB.
| Version | Lines | Adversarial | Structured | Philosophy |
|---|---|---|---|---|
| Base | ~50 | Fails | 18x faster | Simple |
| Bailout | ~100 | Bails to qsort | 18x faster | Pragmatic |
| Adaptive v1 | ~140 | 1.7–3x slower | 3–4x faster | Adaptive |
| Adaptive v2 | ~220 | 1.3–1.7x slower | 4x+ faster | Complete |
| Lean (v3) | ~220 | 1.3–1.7x slower | 4x+ faster | Efficient |
The Minimal Core
Following the tenets to their logical conclusion produces something that already exists. That's not a cop-out. It's the point.
// 7 lines. The tenets, implemented.
void sort(int* a, int n) {
for (int i = 0; i < n - 1; i++)
if (a[i] > a[i+1]) {
int v = a[i+1], j = i;
while (j >= 0 && a[j] > v) { a[j+1] = a[j]; j--; }
a[j+1] = v;
}
}
It's insertion sort. Walk left to right. If the current element is in order, move on: structure exists, exploit it by doing nothing. If it's not in order, slide it backward to its correct position: fix locally. On nearly-sorted data, almost everything hits the continue path and the whole thing runs in O(n). On random data, you get O(n²). Work scales with disorder.
The practical version, Micro SSS, adds a 50-sample reverse detection check. Twenty lines. No malloc. On local adjacent swaps it never loses to qsort at any array size. But on scattered random displacement, it goes from 15ms to catastrophe at just 1% corruption.
Nomenclature: Micro SSS and CAS-Core are the same algorithm. CAS (Consequence-as-Action Sort) gets its own name and philosophical framing in Post 4. In the benchmark they're listed separately because they were developed at different points in the series.
Micro SSS / CAS-Core: The Tenet Sorts
The insertion sort core with reverse detection. Tested on 9 of 35 patterns at 1M elements (N/A on everything that goes quadratic).
Where the Tenets Dominate
| Pattern | MicroSSS | CAS-Core | qsort | introsort | Pendry |
|---|---|---|---|---|---|
| Sorted | 1.5ms | 1.3ms | 53.2ms | 16.2ms | 1.6ms |
| All Identical | 1.5ms | 1.3ms | 4.7ms | 28.7ms | 1.6ms |
| Plateau | 1.5ms | 1.3ms | 27.2ms | 22.2ms | 1.6ms |
| LocalSwap 0.1% | 1.5ms | 1.3ms | 52.2ms | 16.1ms | 2.1ms |
| LocalSwap 1% | 1.6ms | 1.4ms | 52.5ms | 16.2ms | 2.2ms |
| LocalSwap 5% | 1.8ms | 1.6ms | 52.6ms | 16.5ms | 2.5ms |
| LocalSwap 10% | 1.9ms | 1.7ms | 52.6ms | 16.6ms | 2.6ms |
| Localized 0.1% | 1.9ms | 1.7ms | 52.3ms | 16.2ms | 1.7ms |
| Localized 1% | 43.7ms | 43.6ms | 52.9ms | 16.7ms | 2.4ms |
CAS-Core takes 7 wins in the benchmark. On sorted/plateau/identical data, it runs at 1.3ms, which is 40x faster than qsort and 12x faster than introsort. But Localized 1% reveals the cliff: 43.6ms, nearly 20x the cost of 0.1%, because the corruption region exceeds what insertion sort handles efficiently.
The Tenets
Somewhere during all this iteration, the principles underneath the algorithm became clearer than the algorithm itself.
- Structure exists; exploit it. Don't assume your data is random. Most data in real systems isn't. Arrays that were sorted yesterday and had a few updates overnight are mostly sorted. If you treat all input as maximally chaotic, you're doing work that doesn't need doing.
- Find inversions; fix locally. When something is broken, fix that thing. Don't tear the whole structure down and rebuild it. If one book is out of place on a shelf, you don't dump every book on the floor and re-shelve them alphabetically.
- Work scales with disorder. A nearly-sorted array should take nearly no work. A fully random array should take full work. The algorithm's effort should be a function of how much is actually wrong, not how much data exists.
These aren't about sorting. They're about a philosophy of repair. Don't assume everything is broken. Find what actually is. Fix it locally. Let the cost of fixing reflect the cost of what broke.
The sorting algorithm is just where I happened to test it first. But by this point, the tenets had started suggesting something else entirely. What if comparing wasn't the only way to sort? What if the data already knew where it belonged, and we just had to listen?
Beyond sorting: The tenets apply to manufacturing (find drifted stations, don't rebuild the line), data validation (find bad records, don't re-validate everything), and mathematical frameworks (find paradoxes, don't abandon the axiom system). They are design principles, not sorting principles.
Classification
Three algorithms from one post, ranked against all 12 in the unified benchmark:
| Algorithm | Total (tested) | Wins | Tested | Ranking | Best Case | Worst Case |
|---|---|---|---|---|---|---|
| SafeSlot-C | 194,628 µs | 1 | 19/35 | #4 of 12 | O(n) sorted | Unsorted output |
| CAS-Core | 55,067 µs | 7 | 9/35 | #10 of 12 | O(n) sorted | O(n²) random |
| MicroSSS | 56,675 µs | 0 | 9/35 | #11 of 12 | O(n) sorted | O(n²) random |
CAS-Core and MicroSSS have excellent per-test times but are only testable on 9 of 35 patterns. Their total of ~55ms across 9 patterns is misleading without that context. SafeSlot-C handles more patterns (19) but pays for it in overhead from the windowing and qsort fallback machinery.
The honest takeaway: All three excel on exactly one class of input: data that is already mostly in order with local perturbations. For anything else, they need help. The rest of this series builds that help.
Previously: Disorder Clusters — The original insight, Python implementation, and its limits.
Next: Who Decided What Sorted Is? — What happens when you stop comparing and start counting.