DialSort: Non-Comparative Integer Sorting via the Self-Indexing Principle: Architecture, Implementation, and Substrate-Aware Analysis

Alexander Narvaez

arxiv: 2605.16667 · v1 · pith:SU5BSH3Pnew · submitted 2026-05-15 · 💻 cs.DS

DialSort: Non-Comparative Integer Sorting via the Self-Indexing Principle: Architecture, Implementation, and Substrate-Aware Analysis

Alexander Narvaez This is my paper

Pith reviewed 2026-05-19 20:29 UTC · model grok-4.3

classification 💻 cs.DS

keywords integer sortingnon-comparative sortcounting sortself-indexingparallel sortingbounded universehistogramconflict resolution

0 comments

The pith

DialSort sorts bounded-universe integers by treating the histogram of counts as the final ordered sequence, eliminating the prefix-sum step through direct key indexing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a non-comparative method for sorting non-negative integers drawn from a known small range. It shows that each key can point straight to its place in an array of counts, so that array itself becomes the sorted result once all keys have been tallied. This removes the usual extra pass that adds up counts to find starting positions. The author supplies proofs that the sequential cost stays linear in the number of keys plus the range size and that a parallel version scales with the number of threads plus a logarithmic term. A reader cares because the change turns a common special case of sorting into a simpler memory read rather than a series of rearrangements.

Core claim

DialSort eliminates the prefix-sum pass entirely by treating the histogram H as the canonical ordered representation, not as an intermediate structure. Each integer key simultaneously encodes its value and its canonical position in the ordered address space [0, U-1]. To support parallel ingestion without serialization, the Conflict Resolution Network resolves concurrent writes using equality checks exclusively, with no magnitude comparisons. Formal proofs establish O(n+U) sequential and O(n/k + log k + U) parallel time bounds.

What carries the argument

the self-indexing principle, under which each key value maps directly to a fixed index in the histogram array that then serves as the sorted output

If this is right

Sequential execution finishes in O(n + U) time for n keys drawn from a universe of size U.
A k-thread parallel version finishes in O(n/k + log k + U) time.
A prototype on 8-thread x86 hardware reaches 39.77 times the speed of std::sort for suitable inputs.
The method beats classic counting sort in 46 of 48 tested configurations and requires no auxiliary scatter buffer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Range-compression steps could be added upstream so that inputs with unknown or large ranges still fit the bounded-U premise.
The pipelined additive reduction tree used for conflict resolution may apply to other concurrent counting or aggregation tasks.
Hardware designs that accelerate equality-checked writes could raise the observed throughput beyond the current 115.9 million keys per second.
The same direct-mapping idea might shorten passes in related bounded-universe tasks such as frequency tables or bucket partitioning.

Load-bearing premise

All input keys are non-negative integers that lie inside a known upper bound U small enough to allocate a histogram array of size U.

What would settle it

Run the algorithm on the input [3, 1, 4, 1, 5] with declared U equal to 6 and check whether the histogram array ends up containing the values in ascending order with no separate prefix-sum calculation performed.

Figures

Figures reproduced from arXiv: 2605.16667 by Alexander Narvaez.

**Figure 1.** Figure 1: DialSort conceptual pipeline with CRN integrated. Input lanes ingest key-count pairs in parallel. The CRN merges equal keys additively using equality detection only (= check; no < or >), producing conflict-free updates to histogram memory H[k]. H[·] is the canonical ordered state; the geometric scan and output vector are derived interface operations, not the core ordering mechanism. strongest gains when n … view at source ↗

**Figure 2.** Figure 2: Cycle-accurate DialSort visualization for input K = {3, 1, 3, 5, 1, 3, 5, 3} (n=8, U=6). Left: the arriving key stream. Centre-left: each cycle performs exactly one H[k]++ — no order comparison is ever evaluated. Centre-right: the histogram H[·] after all 8 cycles; this is the canonical ordered representation — no prefix sum, no reconstruction required. Right: the optional output vector, obtained by scanni… view at source ↗

**Figure 3.** Figure 3: DialSort across three physical substrates, shown for histogram state H={k=1:1, k=3:2, k=5:1}. Top (Substrate 1): water jet; dark gates admit flow ∝ H[k]. Middle (Substrate 2): clock sweeps U FPGA registers; nonzero registers conduct. Bottom (Substrate 3): light pulse; active resonators couple light out. Substrate 2 is an analytical projection; Substrate 3 is a future research direction. 10 [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Abacus-and-torch analogy for DialSort. After ingestion, each value k occupies H[k] independent dials, each holding one bead at the height corresponding to value k: H[0]=1, H[1]=2, H[2]=0, H[3]=3, H[4]=1, H[5]=2. A torch sweeps the value axis from k=0 to U−1; repeated values appear side-by-side on their own dials at the same height. The shadow projected on the wall is the sorted output ⟨0, 1, 1, 3, 3, 3, 4,… view at source ↗

**Figure 5.** Figure 5: shows the average D/C ratio (DialSort ms ÷ Classic CS ms) as a function of N. The ratio decreases as N grows, confirming that DialSort’s structural savings scale favorably with input size. 104 105 106 107 0 0.5 1 parity 0.73 0.66 0.49 0.52 Input size N Avg. ratio D/C (lower = DialSort faster) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Speedup over std::sort at N=107 : DialSort (dark) vs. Classic Counting Sort (light), grouped by distribution and universe size. DialSort achieves higher speedup in all 12 configurations. The largest gap appears at large U (65536) under skewed distributions [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: decomposes the average D/C ratio by universe size U. U=256 U=1024 U=65536 0 0.5 1 parity 0.55 0.55 0.74 Universe size U Avg. ratio D/C [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Head-to-head throughput comparison: DialSort (dark) vs. Classic Counting Sort (light) across all 48 benchmark configurations. DialSort wins in 46 of 48 cases. 10.2 Experimental setup The benchmark uses IPS4o (sequential, headeronly, cloned from the official repository) compiled under identical conditions (g++ 11.4.0 -O3 -std=c++17 -pthread) with TBB parallel support disabled to isolate sequential algori… view at source ↗

**Figure 9.** Figure 9: Throughput at N=107 : DialSort (dark), IPS4 o-seq (medium), and std::sort (light), across all distribution and universe-size combinations. On uniform and skewed inputs DialSort dominates at every U. On sorted/reverse inputs IPS4 o exploits pre-existing order via branch prediction, reaching >1500 M keys/s — outside DialSort’s target domain [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: DialSort performance envelope vs. IPS4o by (N, U) cell. Left: Sequential (DialSort vs. IPS4 o-seq, single thread). Right: Parallel (DialSort-Parallel vs. IPS4 o-par, 8 threads each). Each cell shows the number of distributions (out of 4: uniform, skewed, sorted, reverse-sorted) won by DialSort. Darker shading = fewer wins; lighter parchment = more wins. The sequential envelope shows a consistent 2/4 patte… view at source ↗

**Figure 11.** Figure 11: Throughput at N=107 : DialSort (dark), ska sort (medium), and std::sort (light), across all distribution and universe-size combinations. On uniform and skewed inputs DialSort dominates at every U. Unlike the IPS4 o comparison, DialSort also outperforms ska sort on sorted and reverse-sorted inputs — ska sort does not exploit pre-existing order. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Sorting over bounded-universe integer keys has traditionally relied on counting sort and radix sort, both of which incur mandatory prefix-sum passes, auxiliary scatter buffers, or multiple permutation passes. This paper introduces DialSort, a non-comparative sorting architecture based on the self-indexing principle: each integer key simultaneously encodes its value and its canonical position in the ordered address space [0,U-1]. DialSort eliminates the prefix-sum pass entirely by treating the histogram H as the canonical ordered representation, not as an intermediate structure. To support parallel ingestion without serialization, we introduce the Conflict Resolution Network (CRN), a pipelined additive reduction tree that resolves concurrent writes using equality checks exclusively, with no magnitude comparisons. Formal proofs establish O(n+U) sequential and O(n/k + log k + U) parallel time bounds. A software prototype on an 8-thread Intel x86-64 achieves 39.77x speedup over std::sort and peak throughput of 115.9 M keys/s. Against Classic Counting Sort, DialSort wins 46 of 48 configurations. Against IPS4o, DialSort outperforms it in 24 of 48 sequential and 29 of 48 parallel configurations. Against ska_sort, it wins 46 of 48 configurations. All 208 benchmark configurations passed correctness verification. DialSort is not a universal replacement for comparison-based sorting, but a domain-specialized architecture for bounded-universe workloads where sorting reduces to a geometric read over memory. Benchmark source and five open interactive simulators are released alongside this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DialSort claims to skip the prefix sum in bounded-integer sorting by self-indexing the histogram as the final output and using an equality-only CRN for parallel writes, but everything rests on knowing U in advance.

read the letter

Look, the punchline on this DialSort paper is that it presents a sorting method for integers in a known bounded range that avoids the usual prefix sum by making the histogram the final sorted structure through what they call self-indexing. What is actually new here is that self-indexing principle, where keys directly point to their positions in the ordered address space, combined with the Conflict Resolution Network. The CRN is a pipelined reduction tree that resolves concurrent writes using only equality checks, no magnitude comparisons. That setup allows parallel ingestion without the typical serialization issues. The paper does a good job on the implementation side. They have a software prototype that achieves significant speedups, 39.77 times over std::sort, with peak throughput of 115.9 million keys per second on an 8-thread machine. It outperforms classic counting sort in 46 of 48 configurations and does well against IPS4o and ska_sort in many cases too. All 208 benchmark configurations passed correctness checks, and they provide the source code along with interactive simulators, which adds credibility to the empirical claims. Where it gets soft is in the foundational assumptions. The method requires knowing the universe size U in advance and that all input keys are non-negative integers within [0, U-1]. Without that, you'd likely need extra passes to determine the range or handle resizing, which could bring back costs similar to the prefix sum they're trying to eliminate. The stated time bounds of O(n+U) sequential and O(n/k + log k + U) parallel seem derived from the architecture, but the parallel one includes the log k for the CRN depth. The abstract mentions formal proofs, but since we're looking at high-level descriptions, the strength depends on how those derivations handle edge cases like collisions or varying k. This work is for researchers and engineers dealing with high-volume bounded-universe integer sorting in areas like databases, data structures, or parallel computing. A reader who cares about specialized non-comparative algorithms and practical throughput improvements would find value in the architecture and the benchmark results. It has enough novelty in the approach and supporting evidence from benchmarks to merit a serious referee. I would recommend putting it through peer review so the proofs can be examined closely and any limitations around the bounded U assumption can be clarified.

Referee Report

0 major / 4 minor

Summary. The paper introduces DialSort, a non-comparative sorting architecture for bounded-universe non-negative integers based on the self-indexing principle. Keys map directly to positions in a histogram H, which is used as the canonical ordered representation, eliminating the prefix-sum pass. The Conflict Resolution Network (CRN) enables parallel processing by resolving concurrent writes with equality checks. Formal proofs claim O(n+U) sequential and O(n/k + log k + U) parallel time bounds. A prototype achieves 39.77x speedup over std::sort, wins against Classic Counting Sort in 46/48 configs, and all 208 benchmarks pass correctness verification.

Significance. This work provides a specialized sorting method that could offer performance advantages in domains with known small universe sizes, such as certain data processing tasks. The empirical speedups and open-source release of benchmarks and simulators are notable strengths, supporting reproducibility. The approach innovates on traditional counting sort by avoiding prefix sums through direct use of the histogram for ordered output generation.

minor comments (4)

Clarify in the architecture section how the sorted output is explicitly constructed from the histogram H to confirm the elimination of any equivalent linear pass.
Provide more details on the CRN implementation, including how equality checks are used in the reduction tree, perhaps with pseudocode or a figure reference.
In the benchmark section, specify the exact values of U, n, and thread counts (k) used in the 48 configurations for each comparison to aid reproducibility.
The abstract mentions 'Substrate-Aware Analysis'; expand on this in the main text, explaining what substrate means in this context (e.g., hardware or memory model).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work and for recommending minor revision. The referee correctly identifies the core innovations of DialSort, including the elimination of the prefix-sum pass via self-indexing, the role of the Conflict Resolution Network, the stated time bounds, and the empirical results across the benchmark suite. We appreciate the acknowledgment of the open-source artifacts and reproducibility support.

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper derives its O(n+U) sequential and O(n/k + log k + U) parallel time bounds directly from the described architecture: a single pass to build histogram H in O(n) time followed by an ordered traversal over [0, U-1] that emits each key according to its count in O(U + n) time. The CRN is introduced as a pipelined reduction tree using only equality checks for conflict resolution, yielding the parallel bound by construction from the tree depth and per-key work. These follow from the explicit steps and the self-indexing principle (keys as direct indices into [0, U-1]) under the stated precondition of known bounded U. No equations or claims reduce a result to a fitted parameter or prior self-citation; benchmarks are reported as separate empirical measurements. The bounded-universe assumption is a domain precondition, not a circularly derived claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about the input distribution and on the introduction of one new algorithmic component; no numeric parameters are fitted to data.

axioms (2)

domain assumption Input keys are non-negative integers drawn from a known bounded universe [0, U-1].
This premise enables direct mapping of each key to a canonical address and allocation of a fixed-size histogram.
domain assumption U is small enough that an array of size U fits in memory.
Required for the histogram to serve as the ordered representation without additional compression or paging.

invented entities (1)

Conflict Resolution Network (CRN) no independent evidence
purpose: Pipelined additive reduction tree that resolves concurrent writes to the same address using equality checks only.
New component introduced to support parallel ingestion without serialization or magnitude comparisons.

pith-pipeline@v0.9.0 · 5819 in / 1646 out tokens · 69795 ms · 2026-05-19T20:29:06.706382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each integer key k simultaneously encodes its value and its canonical position in the ordered address space [0,U−1]. DialSort eliminates the prefix-sum pass entirely by treating the histogram H[·] as the canonical ordered representation
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Conflict Resolution Network (CRN), a pipelined additive reduction tree that resolves concurrent writes using equality checks exclusively, with no magnitude comparisons

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

D. E. Knuth,The Art of Computer Program- ming, Vol. 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998

work page 1998
[2]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to Algorithms, 3rd ed. MIT Press, 2009

work page 2009
[3]

In-place parallel super scalar sam- plesort (IPS 4o),

M. Axtmann, S. Witt, D. Ferizovic, and P. Sanders, “In-place parallel super scalar sam- plesort (IPS 4o),”ACM Trans. Parallel Com- put., vol. 9, no. 1, pp. 1–62, 2022

work page 2022
[4]

Super scalar sam- ple sort,

P. Sanders and S. Winkel, “Super scalar sam- ple sort,” inProc. ESA, 2004, pp. 784–796

work page 2004
[5]

Information sorting in the application of electronic digital computers to business operations,

H. H. Seward, “Information sorting in the application of electronic digital computers to business operations,” MIT, Tech. Rep., 1954

work page 1954
[6]

Sorting networks and their ap- plications,

K. E. Batcher, “Sorting networks and their ap- plications,” inAFIPS Spring Joint Comput. Conf., 1968, pp. 307–314

work page 1968
[7]

On the computational power of bead sort,

N. Auger, C. Nicaud, and C. Pivoteau, “On the computational power of bead sort,” RAIRO – Theor. Inform. Appl., vol. 46, no. 3, pp. 381–393, 2012

work page 2012
[8]

Prefix sums and their appli- cations,

G. E. Blelloch, “Prefix sums and their appli- cations,” CMU School of Computer Science Tech. Rep. CMU-CS-90-190, 1990

work page 1990
[9]

Sedgewick and K

R. Sedgewick and K. Wayne,Algorithms, 4th ed. Addison-Wesley, 2011,§6.8. 24

work page 2011
[10]

A new sort algorithm: self- indexed sort,

S. Y. Wang, “A new sort algorithm: self- indexed sort,”ACM SIGPLAN Notices, vol. 31, no. 3, pp. 28–36, 1996

work page 1996
[11]

Integer sorting inO(n √log logn) expected time and linear space,

Y. Han and M. Thorup, “Integer sorting inO(n √log logn) expected time and linear space,” inProc. IEEE FOCS, 2002, pp. 135– 144

work page 2002
[12]

Alveo U50 data center accelerator card data sheet,

Xilinx Inc., “Alveo U50 data center accelerator card data sheet,” DS965, 2021

work page 2021
[13]

Wafer-Scale Engine 3 (WSE-3) Data Sheet,

Cerebras Systems, “Wafer-Scale Engine 3 (WSE-3) Data Sheet,” DS03 v3, 2024

work page 2024
[14]

Supercharge your HPC research with the Cerebras SDK,

Cerebras Systems, “Supercharge your HPC research with the Cerebras SDK,”Tech- nical Blog, March 2025.https://www. cerebras.ai/blog/supercharge-your- hpc-research-with-the-cerebras-sdk 25

work page 2025

[1] [1]

D. E. Knuth,The Art of Computer Program- ming, Vol. 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998

work page 1998

[2] [2]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to Algorithms, 3rd ed. MIT Press, 2009

work page 2009

[3] [3]

In-place parallel super scalar sam- plesort (IPS 4o),

M. Axtmann, S. Witt, D. Ferizovic, and P. Sanders, “In-place parallel super scalar sam- plesort (IPS 4o),”ACM Trans. Parallel Com- put., vol. 9, no. 1, pp. 1–62, 2022

work page 2022

[4] [4]

Super scalar sam- ple sort,

P. Sanders and S. Winkel, “Super scalar sam- ple sort,” inProc. ESA, 2004, pp. 784–796

work page 2004

[5] [5]

Information sorting in the application of electronic digital computers to business operations,

H. H. Seward, “Information sorting in the application of electronic digital computers to business operations,” MIT, Tech. Rep., 1954

work page 1954

[6] [6]

Sorting networks and their ap- plications,

K. E. Batcher, “Sorting networks and their ap- plications,” inAFIPS Spring Joint Comput. Conf., 1968, pp. 307–314

work page 1968

[7] [7]

On the computational power of bead sort,

N. Auger, C. Nicaud, and C. Pivoteau, “On the computational power of bead sort,” RAIRO – Theor. Inform. Appl., vol. 46, no. 3, pp. 381–393, 2012

work page 2012

[8] [8]

Prefix sums and their appli- cations,

G. E. Blelloch, “Prefix sums and their appli- cations,” CMU School of Computer Science Tech. Rep. CMU-CS-90-190, 1990

work page 1990

[9] [9]

Sedgewick and K

R. Sedgewick and K. Wayne,Algorithms, 4th ed. Addison-Wesley, 2011,§6.8. 24

work page 2011

[10] [10]

A new sort algorithm: self- indexed sort,

S. Y. Wang, “A new sort algorithm: self- indexed sort,”ACM SIGPLAN Notices, vol. 31, no. 3, pp. 28–36, 1996

work page 1996

[11] [11]

Integer sorting inO(n √log logn) expected time and linear space,

Y. Han and M. Thorup, “Integer sorting inO(n √log logn) expected time and linear space,” inProc. IEEE FOCS, 2002, pp. 135– 144

work page 2002

[12] [12]

Alveo U50 data center accelerator card data sheet,

Xilinx Inc., “Alveo U50 data center accelerator card data sheet,” DS965, 2021

work page 2021

[13] [13]

Wafer-Scale Engine 3 (WSE-3) Data Sheet,

Cerebras Systems, “Wafer-Scale Engine 3 (WSE-3) Data Sheet,” DS03 v3, 2024

work page 2024

[14] [14]

Supercharge your HPC research with the Cerebras SDK,

Cerebras Systems, “Supercharge your HPC research with the Cerebras SDK,”Tech- nical Blog, March 2025.https://www. cerebras.ai/blog/supercharge-your- hpc-research-with-the-cerebras-sdk 25

work page 2025