Scalable Concurrent Queues for GPU

Pratheek Prakash Shetty; Thomas R. W. Scogland; Wu-chun Feng

arxiv: 2606.01693 · v1 · pith:I36KLBKMnew · submitted 2026-06-01 · 💻 cs.DC · cs.DS

Scalable Concurrent Queues for GPU

Pratheek Prakash Shetty , Thomas R. W. Scogland , Wu-chun Feng This is my paper

Pith reviewed 2026-06-28 12:59 UTC · model grok-4.3

classification 💻 cs.DC cs.DS

keywords concurrent queuesGPUlock-freewait-freelinearizabilitySIMTatomic operationsHPC

0 comments

The pith

Three new GPU queue designs achieve lock-free to wait-free linearizable concurrency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces three concurrent queue implementations for GPUs that maintain linearizability while providing different progress guarantees. G-WFQ-YMC adapts an existing wait-free queue using preallocated segments. G-LFQ uses wave-batched fast paths for bounded lock-free operation, and G-WFQ packs state into 64-bit compare-and-swap for bounded wait-free behavior. These address the challenge of efficient item movement between threads under GPU SIMT execution and high atomic contention. A reader would care because queues are critical bottlenecks for task distribution and load balancing as HPC systems exceed ten million cores.

Core claim

We present three linearizable GPU concurrent queues spanning from lock-free to wait-free guarantees: (1) G-WFQ-YMC, an adaptation of Yang and Mellor-Crummey's wait-free queue using preallocated segments; (2) G-LFQ, a bounded lock-free queue that uses wave-batched fast paths to maximize throughput; and (3) G-WFQ, a bounded wait-free queue that packs shared state into 64-bit compare-and-swap operations while preserving linearizability and bounded memory.

What carries the argument

The three queue variants G-WFQ-YMC, G-LFQ, and G-WFQ that adapt CPU designs to the GPU SIMT model via preallocated segments, wave-batched fast paths, and packed 64-bit CAS operations.

Load-bearing premise

The GPU SIMT execution model and atomic contention patterns allow the described adaptations to deliver the claimed lock-free and wait-free progress without breaking linearizability or bounded memory.

What would settle it

Executing the three queue implementations on actual GPU hardware under high contention and detecting either a linearizability violation or a failure to meet the stated progress guarantee would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.01693 by Pratheek Prakash Shetty, Thomas R. W. Scogland, Wu-chun Feng.

**Figure 1.** Figure 1: Wave-batching reduces atomic contention c) Wave-batched ticket reservation: A naive FAA-based queue performs one fetch-and-add per active thread on Head or Tail. On a GPU, that needlessly concentrates contention on the same counter word. G-LFQ instead batches reservations within a wavefront. Active lanes first form a mask with a ballot instruction. One leader performs a single fetch-and-add by the number o… view at source ↗

**Figure 2.** Figure 2: Entry word for G-WFQ equivalent to the sequential ticket order assumed by sCQ. By Lemma III.2, the reduced-width cycle field preserves the same live-slot ordering as the unbounded counter view on all reachable states. Since each slot update is performed by a single 64-bit atomic operation, the queue exposes the same slot semantics to concurrent threads as the sCQ ring. Therefore, GLFQ is linearizable and … view at source ↗

**Figure 3.** Figure 3: Global and local Tail/Head word for G-WFQ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Fixed-duration successful-operation throughput (Mops/s) across four queue micro-benchmarks on AMD MI210 and MI300A: balanced [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-operation profiling metrics across four queue micro-benchmarks on AMD MI300A and MI210. Each row shows WAIT/op and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: BFS runtime relative to Gunrock across 9 graphs. G-LFQ consistently beats Gunrock, while G-WFQ is faster on several graphs. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Ray-tracing throughput relative to stream-compaction across both scenes and GPUs. On MI210, G-LFQ outperforms stream-compaction [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Concurrent queues can significantly impact supercomputing performance by being critical bottlenecks for task distribution, load balancing, and resource utilization. As HPC systems move beyond 10-million processor cores, the ability to rapidly move items between producer and consumer threads without excessive locking is essential for efficient queues, preventing idle cores, maximizing utilization, and achieving high parallel speedup. While concurrent queues are well studied on CPUs, they remain largely unexplored on modern GPUs, where SIMT execution, massive parallelism, and atomic contention reshape the design space. We present three linearizable GPU concurrent queues spanning from lock-free to wait-free guarantees: (1) G-WFQ-YMC, an adaptation of Yang and Mellor-Crummey's wait-free queue using preallocated segments; (2) G-LFQ, a bounded lock-free queue that uses wave-batched fast paths to maximize throughput; and (3) G-WFQ, a bounded wait-free queue that packs shared state into 64-bit compare-and-swap operations while preserving linearizability and bounded memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Three new GPU queue designs adapt known techniques to SIMT constraints with clear claims on linearizability and progress, but the value hinges on whether the full paper backs them with measurements and arguments.

read the letter

The paper's core contribution is three concrete queue designs for GPUs: G-WFQ-YMC adapts a segment-based wait-free queue, G-LFQ adds wave batching for a bounded lock-free path, and G-WFQ packs state into 64-bit CAS for wait-free behavior while keeping memory bounded. These target the SIMT execution and high contention that make CPU queues perform poorly on GPUs.

It does a solid job stating the problem in HPC task distribution and load balancing at scale, and the designs follow a standard pattern of preallocation and atomic packing to fit GPU limits. The claims about linearizability and progress guarantees are explicit and tied directly to the mechanisms described.

The soft spot is evidence. The abstract alone gives no proofs, throughput numbers, or contention tests, so the full paper must show that the adaptations actually preserve the guarantees under real SIMT scheduling and atomic traffic. If the arguments rest only on informal reasoning about wave execution without a counter-example check or formal sketch, that weakens the central claim. Minor issues like missing edge cases in bounded memory would be easy to fix.

This is for people building GPU runtimes or concurrent structures in HPC. A reader already working on lock-free algorithms will see the specific tweaks and can judge the fit to their workloads.

It deserves peer review because the topic matters and the designs are grounded enough to test. Send it out and let referees check the implementation details and numbers.

Referee Report

2 major / 0 minor

Summary. The manuscript presents three linearizable concurrent queue designs for GPUs: (1) G-WFQ-YMC, an adaptation of Yang and Mellor-Crummey's wait-free queue using preallocated segments; (2) G-LFQ, a bounded lock-free queue employing wave-batched fast paths; and (3) G-WFQ, a bounded wait-free queue that packs shared state into 64-bit CAS operations. The central claims are that these designs achieve the stated progress guarantees (lock-free to wait-free) while preserving linearizability and bounded memory under the GPU SIMT execution model and atomic contention patterns.

Significance. If the designs can be shown to deliver the claimed properties, the work would address an important gap in GPU data structures for HPC task distribution and load balancing. The adaptation strategy (segment preallocation, wave batching, 64-bit CAS packing) follows standard patterns and could be valuable if accompanied by validation. No machine-checked proofs, reproducible code, or falsifiable predictions are present to strengthen the assessment.

major comments (2)

[Abstract] Abstract: The claims of linearizability plus lock-free/wait-free progress guarantees are stated without any proofs, invariants, or detailed argument showing why the specific adaptations (preallocated segments, wave batching, 64-bit CAS packing) preserve these properties under SIMT execution and atomic contention. This is load-bearing for the central contribution.
[Abstract] Abstract: No implementation details, performance measurements, or experimental results are supplied to support the throughput claims or to demonstrate that the designs achieve the stated guarantees in practice. This absence prevents assessment of whether the weakest assumption (SIMT + atomic contention preserving progress and linearizability) actually holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The manuscript presents algorithmic designs for GPU queues with informal arguments for their properties; we address the concerns below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of linearizability plus lock-free/wait-free progress guarantees are stated without any proofs, invariants, or detailed argument showing why the specific adaptations (preallocated segments, wave batching, 64-bit CAS packing) preserve these properties under SIMT execution and atomic contention. This is load-bearing for the central contribution.

Authors: We agree that explicit invariants and arguments are needed to substantiate the claims. The designs adapt established CPU techniques, with segment preallocation avoiding dynamic allocation under SIMT, wave batching mitigating divergence, and 64-bit CAS reducing contention points. In revision we will add a dedicated section with key invariants and proof sketches for linearizability and progress guarantees under the stated GPU model. revision: yes
Referee: [Abstract] Abstract: No implementation details, performance measurements, or experimental results are supplied to support the throughput claims or to demonstrate that the designs achieve the stated guarantees in practice. This absence prevents assessment of whether the weakest assumption (SIMT + atomic contention preserving progress and linearizability) actually holds.

Authors: The manuscript is a design paper and contains no experimental results or detailed implementation code. We will revise to include operation pseudocode and, space permitting, notes on a prototype implementation to illustrate how the adaptations function under SIMT and atomic contention. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents three new GPU queue designs (G-WFQ-YMC, G-LFQ, G-WFQ) via direct construction and adaptation of external prior work (Yang and Mellor-Crummey). No equations, fitted parameters, predictions, or self-referential reductions appear in the abstract or claim text. All progress and linearizability claims rest on explicit design choices (segment preallocation, wave batching, 64-bit CAS packing) whose correctness is conditioned on the stated SIMT/atomic assumption rather than any internal fit or self-citation chain. This is a standard engineering paper with no load-bearing derivation steps that reduce to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5709 in / 1050 out tokens · 25528 ms · 2026-06-28T12:59:15.216898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 21 canonical work pages

[1]

and Scott, Michael L

M. M. Michael and M. L. Scott, “Simple, fast, and practical non- blocking and blocking concurrent queue algorithms,” inProceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, ser. PODC ’96. New York, NY , USA: Association for Computing Machinery, 1996, p. 267–275. [Online]. Available: https://doi.org/10.1145/248052.248106

work page doi:10.1145/248052.248106 1996
[2]

A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems,

P. Tsigas and Y . Zhang, “A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems,” in Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, ser. SPAA ’01. New York, NY , USA: Association for Computing Machinery, 2001, p. 134–143. [Online]. Available: https://doi.org/10...

work page doi:10.1145/378580.378611 2001
[3]

A methodology for creating fast wait-free data structures,

A. Kogan and E. Petrank, “A methodology for creating fast wait-free data structures,”SIGPLAN Not., vol. 47, no. 8, p. 141–150, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145835

work page doi:10.1145/2370036.2145835 2012
[4]

Wait-free queues with multiple enqueuers and dequeuers,

——, “Wait-free queues with multiple enqueuers and dequeuers,” in Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 223–234. [Online]. Available: https://doi.org/10.1145/1941553.1941585

work page doi:10.1145/1941553.1941585 2011
[5]

A wait-free queue as fast as fetch-and-add,

C. Yang and J. Mellor-Crummey, “A wait-free queue as fast as fetch-and-add,” inProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY , USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2851141.2851168

work page doi:10.1145/2851141.2851168 2016
[6]

SIGPLAN Not

A. Morrison and Y . Afek, “Fast concurrent queues for x86 processors,” SIGPLAN Not., vol. 48, no. 8, p. 103–112, Feb. 2013. [Online]. Available: https://doi.org/10.1145/2517327.2442527

work page doi:10.1145/2517327.2442527 2013
[7]

A Scalable, Portable, and Memory-Efficient Lock- Free FIFO Queue,

R. Nikolaev, “A Scalable, Portable, and Memory-Efficient Lock- Free FIFO Queue,” in33rd International Symposium on Distributed Computing (DISC 2019), ser. Leibniz International Proceedings in Informatics (LIPIcs), J. Suomela, Ed., vol. 146. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum f ¨ur Informatik, 2019, pp. 28:1– 28:16. [Online]. Available: ...

work page doi:10.4230/lipics.disc.2019.28 2019
[8]

wcq: a fast wait-free queue with bounded memory usage,

R. Nikolaev and B. Ravindran, “wcq: a fast wait-free queue with bounded memory usage,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 461–462. [Online]. Available: https://doi.org/10. 1145/3503221.3508440

arXiv 2022
[9]

Aggregating funnels for faster fetch&add and queues,

Y . Roh, Y . Wei, E. Ruppert, P. Fatourou, S. Jayanti, and J. Shun, “Aggregating funnels for faster fetch&add and queues,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 99–114. [Online]. Available: https://doi.org/...

work page doi:10.1145/3710848.3710873 2025
[10]

Splitwise: Efficient generative LLM inference using phase splitting,

A. Smith, G. H. Loh, M. J. Schulte, M. Ignatowski, S. Naffziger, M. Mantor, M. Fowler, N. Kalyanasundharam, V . Alla, N. Malaya, J. L. Greathouse, E. Chapman, and R. Swaminathan, “Realizing the amd exascale heterogeneous processor vision,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p...

work page doi:10.1109/isca59077.2024.00068 2025
[11]

Frontier: Exploring exascale

S. L. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V . Melesse Vergara, T. Beck, M. Brim, T. Evans, R. Budiardja, S. Chandrasekaran et al., “Frontier: Exploring exascale.” Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States), 04 2024. [Online]. Available: https://www.osti.gov/biblio/2438964

arXiv 2024
[12]

Design and evaluation of scalable concurrent queues for many-core architectures,

T. R. Scogland and W.-c. Feng, “Design and evaluation of scalable concurrent queues for many-core architectures,” inProceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’15. New York, NY , USA: Association for Computing Machinery, 2015, p. 63–74. [Online]. Available: https://doi.org/10.1145/2668930.2688048

work page doi:10.1145/2668930.2688048 2015
[13]

The broker queue: A fast, linearizable fifo queue for fine-granular work distribution on the gpu,

B. Kerbl, M. Kenzel, J. H. Mueller, D. Schmalstieg, and M. Steinberger, “The broker queue: A fast, linearizable fifo queue for fine-granular work distribution on the gpu,” inProceedings of the 2018 International Conference on Supercomputing, ser. ICS ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 76–85. [Online]. Available: https:/...

work page doi:10.1145/3205289.3205291 2018
[14]

Boundary-aware concurrent queue: A fast and scalable concurrent fifo queue on gpu environments,

M. S. H. Polak, D. A. Troendle, and B. Jang, “Boundary-aware concurrent queue: A fast and scalable concurrent fifo queue on gpu environments,”Applied Sciences, vol. 15, no. 4, 2025. [Online]. Available: https://www.mdpi.com/2076-3417/15/4/1834

2025
[15]

and Wing, Jeannette M

M. P. Herlihy and J. M. Wing, “Linearizability: a correctness condition for concurrent objects,”ACM Trans. Program. Lang. Syst., vol. 12, no. 3, p. 463–492, Jul. 1990. [Online]. Available: https://doi.org/10.1145/78969.78972

work page doi:10.1145/78969.78972 1990
[16]

Faster linearizability checking via p-compositionality,

A. Horn and D. Kroening, “Faster linearizability checking via p-compositionality,” inFormal Techniques for Distributed Objects, Components, and Systems, S. Graf and M. Viswanathan, Eds. Cham: Springer International Publishing, 2015, pp. 50–65. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-19195-9 4

work page doi:10.1007/978-3-319-19195-9 2015
[17]

The state-of-the-art lcrq concurrent queue algorithm does not require cas2,

R. Romanov and N. Koval, “The state-of-the-art lcrq concurrent queue algorithm does not require cas2,” inProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 14–26. [Online]. Available: https://doi.org/10.1145/3572848.3577485

work page doi:10.1145/3572848.3577485 2023
[18]

Gunrock: Gpu graph analytics,

Y . Wang, Y . Pan, A. Davidson, Y . Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens, “Gunrock: Gpu graph analytics,”ACM Trans. Parallel Comput., vol. 4, no. 1, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3108140

work page doi:10.1145/3108140 2017
[19]

Optix: a general purpose ray tracing engine,

S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich, “Optix: a general purpose ray tracing engine,” ACM Trans. Graph., vol. 29, no. 4, Jul. 2010. [Online]. Available: https://doi.org/10.1145/1778765.1778803

work page doi:10.1145/1778765.1778803 2010
[20]

Megakernels considered harmful: wavefront path tracing on gpus,

S. Laine, T. Karras, and T. Aila, “Megakernels considered harmful: wavefront path tracing on gpus,” inProceedings of the 5th High- Performance Graphics Conference, ser. HPG ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 137–143. [Online]. Available: https://doi.org/10.1145/2492045.2492060

work page doi:10.1145/2492045.2492060 2013
[21]

Active thread compaction for gpu path tracing,

I. Wald, “Active thread compaction for gpu path tracing,” in Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, ser. HPG ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 51–58. [Online]. Available: https://doi.org/10.1145/2018323.2018331

work page doi:10.1145/2018323.2018331 2011
[22]

On ray reordering techniques for faster gpu ray tracing,

D. Meister, J. Boksansky, M. Guthe, and J. Bittner, “On ray reordering techniques for faster gpu ray tracing,” inSymposium on Interactive 3D Graphics and Games, ser. I3D ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3384382.3384534

work page doi:10.1145/3384382.3384534 2020
[23]

Vectorized production path tracing,

M. Lee, B. Green, F. Xie, and E. Tabellion, “Vectorized production path tracing,” inProceedings of High Performance Graphics, ser. HPG ’17. New York, NY , USA: Association for Computing Machinery,
[24]

Available: https://doi.org/10.1145/3105762.3105768

[Online]. Available: https://doi.org/10.1145/3105762.3105768

work page doi:10.1145/3105762.3105768
[25]

The University of Florida sparse matrix collection,

T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Trans. Math. Softw., vol. 38, no. 1, Dec. 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011

[1] [1]

and Scott, Michael L

M. M. Michael and M. L. Scott, “Simple, fast, and practical non- blocking and blocking concurrent queue algorithms,” inProceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, ser. PODC ’96. New York, NY , USA: Association for Computing Machinery, 1996, p. 267–275. [Online]. Available: https://doi.org/10.1145/248052.248106

work page doi:10.1145/248052.248106 1996

[2] [2]

A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems,

P. Tsigas and Y . Zhang, “A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems,” in Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, ser. SPAA ’01. New York, NY , USA: Association for Computing Machinery, 2001, p. 134–143. [Online]. Available: https://doi.org/10...

work page doi:10.1145/378580.378611 2001

[3] [3]

A methodology for creating fast wait-free data structures,

A. Kogan and E. Petrank, “A methodology for creating fast wait-free data structures,”SIGPLAN Not., vol. 47, no. 8, p. 141–150, Feb. 2012. [Online]. Available: https://doi.org/10.1145/2370036.2145835

work page doi:10.1145/2370036.2145835 2012

[4] [4]

Wait-free queues with multiple enqueuers and dequeuers,

——, “Wait-free queues with multiple enqueuers and dequeuers,” in Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 223–234. [Online]. Available: https://doi.org/10.1145/1941553.1941585

work page doi:10.1145/1941553.1941585 2011

[5] [5]

A wait-free queue as fast as fetch-and-add,

C. Yang and J. Mellor-Crummey, “A wait-free queue as fast as fetch-and-add,” inProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY , USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2851141.2851168

work page doi:10.1145/2851141.2851168 2016

[6] [6]

SIGPLAN Not

A. Morrison and Y . Afek, “Fast concurrent queues for x86 processors,” SIGPLAN Not., vol. 48, no. 8, p. 103–112, Feb. 2013. [Online]. Available: https://doi.org/10.1145/2517327.2442527

work page doi:10.1145/2517327.2442527 2013

[7] [7]

A Scalable, Portable, and Memory-Efficient Lock- Free FIFO Queue,

R. Nikolaev, “A Scalable, Portable, and Memory-Efficient Lock- Free FIFO Queue,” in33rd International Symposium on Distributed Computing (DISC 2019), ser. Leibniz International Proceedings in Informatics (LIPIcs), J. Suomela, Ed., vol. 146. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum f ¨ur Informatik, 2019, pp. 28:1– 28:16. [Online]. Available: ...

work page doi:10.4230/lipics.disc.2019.28 2019

[8] [8]

wcq: a fast wait-free queue with bounded memory usage,

R. Nikolaev and B. Ravindran, “wcq: a fast wait-free queue with bounded memory usage,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 461–462. [Online]. Available: https://doi.org/10. 1145/3503221.3508440

arXiv 2022

[9] [9]

Aggregating funnels for faster fetch&add and queues,

Y . Roh, Y . Wei, E. Ruppert, P. Fatourou, S. Jayanti, and J. Shun, “Aggregating funnels for faster fetch&add and queues,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 99–114. [Online]. Available: https://doi.org/...

work page doi:10.1145/3710848.3710873 2025

[10] [10]

Splitwise: Efficient generative LLM inference using phase splitting,

A. Smith, G. H. Loh, M. J. Schulte, M. Ignatowski, S. Naffziger, M. Mantor, M. Fowler, N. Kalyanasundharam, V . Alla, N. Malaya, J. L. Greathouse, E. Chapman, and R. Swaminathan, “Realizing the amd exascale heterogeneous processor vision,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p...

work page doi:10.1109/isca59077.2024.00068 2025

[11] [11]

Frontier: Exploring exascale

S. L. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V . Melesse Vergara, T. Beck, M. Brim, T. Evans, R. Budiardja, S. Chandrasekaran et al., “Frontier: Exploring exascale.” Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States), 04 2024. [Online]. Available: https://www.osti.gov/biblio/2438964

arXiv 2024

[12] [12]

Design and evaluation of scalable concurrent queues for many-core architectures,

T. R. Scogland and W.-c. Feng, “Design and evaluation of scalable concurrent queues for many-core architectures,” inProceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’15. New York, NY , USA: Association for Computing Machinery, 2015, p. 63–74. [Online]. Available: https://doi.org/10.1145/2668930.2688048

work page doi:10.1145/2668930.2688048 2015

[13] [13]

The broker queue: A fast, linearizable fifo queue for fine-granular work distribution on the gpu,

B. Kerbl, M. Kenzel, J. H. Mueller, D. Schmalstieg, and M. Steinberger, “The broker queue: A fast, linearizable fifo queue for fine-granular work distribution on the gpu,” inProceedings of the 2018 International Conference on Supercomputing, ser. ICS ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 76–85. [Online]. Available: https:/...

work page doi:10.1145/3205289.3205291 2018

[14] [14]

Boundary-aware concurrent queue: A fast and scalable concurrent fifo queue on gpu environments,

M. S. H. Polak, D. A. Troendle, and B. Jang, “Boundary-aware concurrent queue: A fast and scalable concurrent fifo queue on gpu environments,”Applied Sciences, vol. 15, no. 4, 2025. [Online]. Available: https://www.mdpi.com/2076-3417/15/4/1834

2025

[15] [15]

and Wing, Jeannette M

M. P. Herlihy and J. M. Wing, “Linearizability: a correctness condition for concurrent objects,”ACM Trans. Program. Lang. Syst., vol. 12, no. 3, p. 463–492, Jul. 1990. [Online]. Available: https://doi.org/10.1145/78969.78972

work page doi:10.1145/78969.78972 1990

[16] [16]

Faster linearizability checking via p-compositionality,

A. Horn and D. Kroening, “Faster linearizability checking via p-compositionality,” inFormal Techniques for Distributed Objects, Components, and Systems, S. Graf and M. Viswanathan, Eds. Cham: Springer International Publishing, 2015, pp. 50–65. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-19195-9 4

work page doi:10.1007/978-3-319-19195-9 2015

[17] [17]

The state-of-the-art lcrq concurrent queue algorithm does not require cas2,

R. Romanov and N. Koval, “The state-of-the-art lcrq concurrent queue algorithm does not require cas2,” inProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 14–26. [Online]. Available: https://doi.org/10.1145/3572848.3577485

work page doi:10.1145/3572848.3577485 2023

[18] [18]

Gunrock: Gpu graph analytics,

Y . Wang, Y . Pan, A. Davidson, Y . Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens, “Gunrock: Gpu graph analytics,”ACM Trans. Parallel Comput., vol. 4, no. 1, Aug. 2017. [Online]. Available: https://doi.org/10.1145/3108140

work page doi:10.1145/3108140 2017

[19] [19]

Optix: a general purpose ray tracing engine,

S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich, “Optix: a general purpose ray tracing engine,” ACM Trans. Graph., vol. 29, no. 4, Jul. 2010. [Online]. Available: https://doi.org/10.1145/1778765.1778803

work page doi:10.1145/1778765.1778803 2010

[20] [20]

Megakernels considered harmful: wavefront path tracing on gpus,

S. Laine, T. Karras, and T. Aila, “Megakernels considered harmful: wavefront path tracing on gpus,” inProceedings of the 5th High- Performance Graphics Conference, ser. HPG ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 137–143. [Online]. Available: https://doi.org/10.1145/2492045.2492060

work page doi:10.1145/2492045.2492060 2013

[21] [21]

Active thread compaction for gpu path tracing,

I. Wald, “Active thread compaction for gpu path tracing,” in Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, ser. HPG ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 51–58. [Online]. Available: https://doi.org/10.1145/2018323.2018331

work page doi:10.1145/2018323.2018331 2011

[22] [22]

On ray reordering techniques for faster gpu ray tracing,

D. Meister, J. Boksansky, M. Guthe, and J. Bittner, “On ray reordering techniques for faster gpu ray tracing,” inSymposium on Interactive 3D Graphics and Games, ser. I3D ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3384382.3384534

work page doi:10.1145/3384382.3384534 2020

[23] [23]

Vectorized production path tracing,

M. Lee, B. Green, F. Xie, and E. Tabellion, “Vectorized production path tracing,” inProceedings of High Performance Graphics, ser. HPG ’17. New York, NY , USA: Association for Computing Machinery,

[24] [24]

Available: https://doi.org/10.1145/3105762.3105768

[Online]. Available: https://doi.org/10.1145/3105762.3105768

work page doi:10.1145/3105762.3105768

[25] [25]

The University of Florida sparse matrix collection,

T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Trans. Math. Softw., vol. 38, no. 1, Dec. 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011