SimdQuickHeap: The QuickHeap Reconsidered

Johannes Breitling; Marvin Williams; Ragnar Groot Koerkamp

arxiv: 2604.25681 · v1 · submitted 2026-04-28 · 💻 cs.DS

SimdQuickHeap: The QuickHeap Reconsidered

Johannes Breitling , Ragnar Groot Koerkamp , Marvin Williams This is my paper

Pith reviewed 2026-05-07 15:08 UTC · model grok-4.3

classification 💻 cs.DS

keywords priority queueSIMDQuickHeapdata structuresDijkstraperformanceheapgraph algorithms

0 comments

The pith

SimdQuickHeap reworks the QuickHeap by placing pivots adjacently with dedicated buckets to enable fast SIMD partitioning and scanning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SimdQuickHeap as a layout change to the original QuickHeap that stores all pivots next to each other in memory while keeping elements between consecutive pivots in separate buckets. This arrangement permits efficient use of SIMD instructions both to partition buckets and to scan the pivot list when inserting new elements. A sympathetic reader would care because priority queues underpin many algorithms, and the modification yields amortized expected O(log n) time per operation that improves to O(1/W log n) when SIMD width W is available, together with concrete I/O savings. The authors report that these theoretical gains translate to practice: the structure needs roughly 1.5 log2 n comparisons and log2 n nanoseconds per push-pop pair and is up to twice as fast as the next-best competitor, including non-comparison radix heaps. On graph workloads using Dijkstra and Jarnik-Prim it is consistently the fastest implementation tested.

Core claim

The SimdQuickHeap modifies the QuickHeap data layout so that all pivots lie adjacent in memory and elements are stored in dedicated buckets between consecutive pivots. This change supports SIMD vectorized code for both bucket partitioning and pivot-list scanning while preserving the original structure's amortized expected O(log n) cost per operation; the cost improves to O(1/W log n) in non-degenerate cases, where W is the SIMD register width. The corresponding I/O complexity is O(1/B) per push and O(1/B log2(n/M)) per pop. Synthetic benchmarks show up to 2x speedup over the next-best heap and around 1.5 log2 n comparisons plus log2 n nanoseconds per push-pop pair. Graph benchmarks with both

What carries the argument

The SimdQuickHeap's adjacent-pivots layout with dedicated inter-pivot buckets, which enables SIMD vectorization of partitioning and scanning steps.

If this is right

Each push and pop pair costs O(log n) expected time and improves by the SIMD width factor in non-degenerate inputs.
I/O cost drops to O(1/B) per push and O(1/B log2(n/M)) per pop.
The structure outperforms both comparison-based and non-comparison heaps in practice on the reported workloads.
It becomes the fastest heap when used inside Dijkstra shortest paths and Jarnik-Prim minimum spanning tree.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pivot-adjacent layout idea could be tested on other distribution-based or comparison-based priority queues to see whether SIMD gains generalize.
If the partitioning cost truly stays unchanged, the approach shows that cache and vectorization considerations can be retrofitted to classic structures without asymptotic penalty.
Implementations in languages with portable SIMD intrinsics would let the same code run faster on newer CPUs that widen registers further.

Load-bearing premise

The layout change preserves the original QuickHeap's expected partitioning cost, adds no hidden overhead in degenerate cases, and the chosen benchmark inputs represent typical rather than pathological bucket sizes.

What would settle it

Re-running the paper's synthetic and graph benchmarks on the same inputs and hardware and observing that SimdQuickHeap is no faster than the next-best competitor or exhibits noticeable slowdowns when many equal keys appear.

read the original abstract

Priority queues are data structures that maintain a dynamic collection of elements and allow inserting new elements and removing the smallest element. The most widely known and used priority queue is likely the implicit binary heap, even though it is has frequent cache misses and is hard to optimize using e.g. SIMD instructions. We introduce the SimdQuickHeap, a variant of the QuickHeap that was introduced by Navarro and Paredes in 2010. As suggested by the name, the data structure bears some similarity to QuickSort. We modify the data layout of the original QuickHeap to have all \emph{pivots} adjacent in memory, with elements between consecutive pivots stored in dedicated \emph{buckets}. This allows efficient SIMD implementations for both partitioning of buckets and scanning the list of pivots to find the bucket to append newly inserted elements to. The SimdQuickHeap has amortized expected complexity $O(\log n)$ per operation, which improves to $O(\frac 1W\log n)$ in non-degenerate cases, where $W$ is the number of words in a SIMD register. In this case, the I/O-complexity is amortized $O(\frac 1B)$ per push and $O(\frac 1B \log_2 \frac nM)$ per pop. In synthetic benchmarks, the SimdQuickHeap is up to twice as fast as the next-best competitor, including the non-comparison radix heap, and needs around $1.5\log_2 n$ comparisons and $\log_2 n$ nanoseconds per pair of push and pop operations. On graph benchmarks with Dijkstra's shortest path algorithm and Jarnik-Prim's minimum spanning tree algorithm, the SimdQuickHeap is consistently the fastest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SimdQuickHeap, a SIMD-optimized variant of the 2010 QuickHeap. It rearranges the layout so all pivots are contiguous in memory with elements stored in dedicated buckets between consecutive pivots. This enables SIMD vectorization for bucket partitioning and pivot-list scanning. The claimed amortized expected complexity is O(log n) per operation, improving to O((1/W) log n) in non-degenerate cases (W = SIMD register width in words), with I/O costs O(1/B) per push and O((1/B) log(n/M)) per pop. Synthetic benchmarks report up to 2x speedup versus the next-best competitor (including radix heaps), approximately 1.5 log2 n comparisons and log2 n nanoseconds per push-pop pair; graph benchmarks (Dijkstra, Jarnik-Prim) show it as consistently fastest.

Significance. If the complexity analysis is completed and the reported speedups are shown to be robust, the result would be a practical contribution to high-performance priority queues. The work directly targets cache and SIMD efficiency on modern hardware while retaining a QuickSort-style expected-time partitioning scheme, which could benefit performance-critical graph algorithms.

major comments (2)

[Complexity analysis (likely §3)] The central complexity claim (amortized O((1/W) log n) in non-degenerate cases) rests on the assertion that the new pivot-adjacent layout preserves the original QuickHeap's expected bucket-size distribution and partitioning cost. No derivation, recurrence, or probabilistic analysis of bucket sizes under the contiguous-pivot layout is supplied, nor are bounds given on the probability of degenerate (size-1 or size-O(n)) buckets. This analysis is load-bearing for both the asymptotic improvement and the reported 1.5 log2 n comparison count.
[Experimental evaluation (§4)] Table 1 and the synthetic-benchmark figures report concrete speedups (up to 2x) and operation counts without error bars, standard deviations, or a precise description of the input generator, number of trials, or hardware configuration. The absence of these details makes it impossible to assess whether the observed gains are stable or sensitive to pathological bucket-size distributions that the layout change might introduce.

minor comments (2)

[Abstract] The abstract introduces symbols M, B, and W without definition; these should be defined at first use in the main text and related to cache-line and SIMD parameters.
[Data-structure description (§2)] The description of the SIMD partitioning routine would benefit from explicit pseudocode or a small worked example showing how the contiguous pivot array is scanned and how buckets are updated in vectorized fashion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting opportunities to strengthen the manuscript. We agree that additional rigor in the complexity analysis and more complete experimental reporting will improve the paper, and we will revise accordingly.

read point-by-point responses

Referee: [Complexity analysis (likely §3)] The central complexity claim (amortized O((1/W) log n) in non-degenerate cases) rests on the assertion that the new pivot-adjacent layout preserves the original QuickHeap's expected bucket-size distribution and partitioning cost. No derivation, recurrence, or probabilistic analysis of bucket sizes under the contiguous-pivot layout is supplied, nor are bounds given on the probability of degenerate (size-1 or size-O(n)) buckets. This analysis is load-bearing for both the asymptotic improvement and the reported 1.5 log2 n comparison count.

Authors: We agree that the manuscript would benefit from an explicit probabilistic derivation. The SimdQuickHeap uses identical pivot selection and bucket-partitioning logic to the original QuickHeap; the layout change affects only memory contiguity to enable SIMD, without altering the random sampling of pivots or the uniform distribution of elements between them. Consequently the bucket-size distribution and expected partitioning cost are unchanged. In the revision we will add a subsection to §3 that (i) recalls the original QuickHeap recurrence, (ii) shows that the contiguous-pivot representation preserves the same probabilistic invariants, and (iii) supplies explicit bounds on the probability of size-1 and size-Θ(n) buckets (exponentially small, identical to the 2010 analysis). This will also justify the observed 1.5 log₂ n comparison count. revision: yes
Referee: [Experimental evaluation (§4)] Table 1 and the synthetic-benchmark figures report concrete speedups (up to 2x) and operation counts without error bars, standard deviations, or a precise description of the input generator, number of trials, or hardware configuration. The absence of these details makes it impossible to assess whether the observed gains are stable or sensitive to pathological bucket-size distributions that the layout change might introduce.

Authors: We accept that the current experimental section lacks the statistical and methodological detail needed for reproducibility and robustness assessment. The revised §4 will report means and standard deviations over 30 independent trials for every timing and operation-count figure, include error bars on all plots, and provide an exact description of the input generators (uniform random 32-bit integers for synthetic benchmarks; standard random-graph models for Dijkstra and Jarník-Prim), the number of trials, and the precise hardware configuration (CPU model, SIMD width W = 8 for AVX2, cache sizes, and compiler flags). We will also add a short sensitivity study on inputs engineered to produce degenerate bucket sizes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents SimdQuickHeap as a layout modification of the externally cited QuickHeap (Navarro and Paredes 2010) whose amortized expected O(log n) bound and SIMD improvement are stated to follow from preservation of QuickSort-style partitioning under the new pivot-adjacent layout. No equation or section in the supplied text reduces the claimed bucket-size distribution or complexity to a self-fit, self-definition, or load-bearing self-citation; the original reference is independent. Benchmark numbers are reported as direct measurements rather than quantities defined from the asymptotic result. The derivation therefore remains self-contained against the cited prior work and external empirical checks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the expected-time analysis of the original QuickHeap partitioning and standard assumptions about cache behavior and SIMD instruction costs; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption QuickHeap-style partitioning has expected linear or better cost that is preserved after the pivot-bucket layout change.
Invoked to claim O(log n) amortized expected time and the SIMD improvement factor.

pith-pipeline@v0.9.0 · 5622 in / 1263 out tokens · 81847 ms · 2026-05-07T15:08:58.018723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

4 Lars Arge, Michael A

URL:http://dx.doi.org/10.1145/509907.509950,doi:10.1145/509907.509950. 4 Lars Arge, Michael A. Bender, Erik D. Demaine, Bryan Holland-Minkley, and J. Ian Munro. An optimal cache-oblivious priority queue and its application to graph algorithms.SIAM Journal on Computing, 36(6):1672–1695, January 2007. URL:http://dx.doi.org/10.1137/ S0097539703428324,doi:10....

work page doi:10.1145/509907.509950 2007
[2]

URL:https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.5, doi:10.4230/LIPICS.ESA.2025.5. J. Breitling, R. Groot Koerkamp, M. Williams 15 16 Gerth Stølting Brodal, John Iacono, Casper Moldrup Rysgaard, and Sebastian Wild. Partition- based simple heaps.arXiv, 2026. URL:https://arxiv.org/abs/2603.01206, doi:10.48550/ ARXIV.2603.01206. 17 Ger...

work page doi:10.4230/lipics.esa.2025.5 2025
[3]

URL:http://dx.doi.org/10.1145/512274.3734138,doi:10.1145/512274.3734138. 18 SimdQuickHeap: The QuickHeap Reconsidered A Number of comparisons HeapSort MonotoneConstantSize MonotoneWiggle Wiggle 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Comparisons (1 / (Ops lg n)) BinaryHeap 8-aryHeap WeakHeap SequenceHeap S3qHeap OriginalQuickHeap ScalarQuickHeapB1 ScalarQuickHeapB...

work page doi:10.1145/512274.3734138

[1] [1]

4 Lars Arge, Michael A

URL:http://dx.doi.org/10.1145/509907.509950,doi:10.1145/509907.509950. 4 Lars Arge, Michael A. Bender, Erik D. Demaine, Bryan Holland-Minkley, and J. Ian Munro. An optimal cache-oblivious priority queue and its application to graph algorithms.SIAM Journal on Computing, 36(6):1672–1695, January 2007. URL:http://dx.doi.org/10.1137/ S0097539703428324,doi:10....

work page doi:10.1145/509907.509950 2007

[2] [2]

URL:https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.5, doi:10.4230/LIPICS.ESA.2025.5. J. Breitling, R. Groot Koerkamp, M. Williams 15 16 Gerth Stølting Brodal, John Iacono, Casper Moldrup Rysgaard, and Sebastian Wild. Partition- based simple heaps.arXiv, 2026. URL:https://arxiv.org/abs/2603.01206, doi:10.48550/ ARXIV.2603.01206. 17 Ger...

work page doi:10.4230/lipics.esa.2025.5 2025

[3] [3]

URL:http://dx.doi.org/10.1145/512274.3734138,doi:10.1145/512274.3734138. 18 SimdQuickHeap: The QuickHeap Reconsidered A Number of comparisons HeapSort MonotoneConstantSize MonotoneWiggle Wiggle 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Comparisons (1 / (Ops lg n)) BinaryHeap 8-aryHeap WeakHeap SequenceHeap S3qHeap OriginalQuickHeap ScalarQuickHeapB1 ScalarQuickHeapB...

work page doi:10.1145/512274.3734138