pith. machine review for the scientific record. sign in

arxiv: 2604.02106 · v2 · submitted 2026-04-02 · 💻 cs.SE

Towards an Accurate GPU Data Race Detector

Pith reviewed 2026-05-13 21:20 UTC · model grok-4.3

classification 💻 cs.SE
keywords GPU data race detectionstatic analysisholistic analysisCPU-GPU interactionkernel launchfalse alarmsparameter constraints
0
0 comments X

The pith

Analyzing CPU launch code lets a static detector find every true GPU data race with zero false alarms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HGRD, a static analysis that jointly examines the CPU host code responsible for launching GPU kernels and the GPU kernel code itself. By pulling semantic constraints on kernel parameter values directly from the launch sites in the CPU code, the analysis determines which potential data races can actually occur during execution. This approach detects every true race that dynamic detectors miss because they only observe executed paths, while avoiding the false alarms that arise when static methods assume all possible parameter values. If the claim holds, developers gain a practical way to ensure GPU program correctness without trading off coverage for precision.

Core claim

HGRD performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques such as iGUARD miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.

What carries the argument

HGRD's holistic static analysis that extracts and applies semantic constraints on GPU kernel parameter values from CPU kernel-launch code.

Load-bearing premise

The CPU host code contains extractable semantic information about the values that GPU kernel parameters can actually take during execution.

What would settle it

A GPU program and its CPU launcher where HGRD reports a race that never occurs under any input or misses a race that does occur.

Figures

Figures reproduced from arXiv: 2604.02106 by Ajay Nayak, Anubhab Ghosh, Arkaprava Basu.

Figure 2
Figure 2. Figure 2: Host code setting thread grid dimensions. 1 Asserts in the host code: Programmers often add assert statements to ensure that specific conditions are met in every execution of the program. We demonstrate how leveraging assert conditions can improve the accuracy of race detec￾tion with an example [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Host code creating kernel parameter relations. 1 // Kernel code 2 __global__ 3 void ludInternal(float *m, int dim, int offset) { 4 rid = offset + (blockIdx.y + 1) * BLOCK_SIZE; 5 cid = offset + (blockIdx.x + 1) * BLOCK_SIZE; 6 // compute value of sum 7 m[(rid + threadIdx.y) * dim + cid + threadIdx.x] = sum; 8 } 9 // Host code 10 void lud(float *m, int hMatrixDim, ...) { 11 // hOffset controlling values of … view at source ↗
Figure 4
Figure 4. Figure 4: Host code adding bounds to kernel parameters. 3 Relation among kernel parameters: It is typical for the host code to derive different kernel parameters from the same variable, thereby implicitly creating a relation between the otherwise seemingly independent parameters. We illustrate this with the toneMapping program from HeCBench [23] (Fig￾ure 3). It compresses a high dynamic range image to a lower range.… view at source ↗
Figure 6
Figure 6. Figure 6: depicts the high-level working of HGRD, highlight￾ing our key contributions with colored elements. It compiles a CUDA program consisting of the host and GPU kernel code using CGeist [40] to MLIR intermediate representation [28]. MLIR enables easy analysis of the code. HGRD has five major components: C1 It analyzes kernel code’s MLIR to find pairs of memory access instructions that can never race (e.g., pai… view at source ↗
Figure 7
Figure 7. Figure 7: Expression tree for assert condition in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Creating expression trees from host code analysis. they have Unknown value as they are user inputs. The root node captures the condition among the two leaf nodes (==). Figure 8a shows the simplified code snippet that creates ETs (createET) for an MLIR operation (e.g., a condition). It recursively creates ETs for its constituents, e.g., the left and right operands of a binary operation (lines 8-9), until it… view at source ↗
Figure 9
Figure 9. Figure 9: Getting acquire/release for an access [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Getting scoped acquire/release for an access. cesses, HGRD extends the syn method to receive the scope argument ( [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Data races in GPU programs pose a threat to the reliability of GPU-accelerated software stacks. Prior works proposed various dynamic (runtime) and static (compile-time) techniques to detect races in GPU programs. However, dynamic techniques often miss critical races, as they require the races to manifest during testing. While static ones can catch such races, they often generate numerous false alarms by conservatively assuming values of variables/parameters that cannot ever occur during any execution of the program. We make a key observation that the host (CPU) code that launches GPU kernels contains crucial semantic information about the values that the GPU kernel's parameters can take during execution. Harnessing this hitherto overlooked information helps accurately detect data races in GPU kernel code. We create HGRD, a new state-of-the-art static analysis technique that performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques, such as iGUARD, miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HGRD, a static analysis technique for GPU data-race detection that performs holistic analysis of both CPU host code (which launches kernels) and GPU kernel code. It claims this yields a broad set of true races with zero misses (unlike dynamic tools such as iGUARD) and zero false alarms (unlike static tools such as GPUVerify and FaialAA) by extracting precise constraints on kernel parameters from the host launch site.

Significance. If the central claim holds, the work would be a meaningful contribution to GPU program reliability: it directly addresses the well-known precision gap between dynamic (incomplete) and static (over-approximating) race detectors by exploiting host-kernel semantic linkage that prior GPU race-detection literature has largely ignored.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Host Analysis): the claim that host-code extraction is both sound and sufficiently precise to eliminate all false alarms while preserving every true race is load-bearing. The skeptic note correctly identifies that any over-approximation arising from loops, conditionals, or pointer arithmetic in the launcher will re-introduce spurious races; the manuscript must supply either a formal soundness argument or concrete evidence (e.g., the exact abstract domains used and their precision on the benchmark launch sites) that this does not occur for the evaluated programs.
  2. [§5] §5 (Evaluation): the abstract asserts that HGRD “misses none” and “raises none” while iGUARD misses many and GPUVerify/FaialAA raise tens of false alarms, yet no table, benchmark list, or quantitative breakdown (true-positive count, false-positive count, runtime, memory) is referenced. Without these data the central performance claim cannot be assessed.
minor comments (2)
  1. [§2] Clarify the exact race classes covered (intra-kernel only, inter-kernel, host-device, etc.) and whether the analysis is flow-, context-, or path-sensitive.
  2. [§6] Add a short related-work paragraph contrasting HGRD with recent hybrid CPU-GPU analyses (e.g., any post-2022 papers on host-kernel parameter inference).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and constructive feedback. We appreciate the identification of areas where the manuscript's claims require stronger support. We address each major comment below and will revise the paper to incorporate the requested clarifications and evidence.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Host Analysis): the claim that host-code extraction is both sound and sufficiently precise to eliminate all false alarms while preserving every true race is load-bearing. The skeptic note correctly identifies that any over-approximation arising from loops, conditionals, or pointer arithmetic in the launcher will re-introduce spurious races; the manuscript must supply either a formal soundness argument or concrete evidence (e.g., the exact abstract domains used and their precision on the benchmark launch sites) that this does not occur for the evaluated programs.

    Authors: We agree that the soundness and precision of host-code extraction is central and load-bearing. The current manuscript sketches the analysis but lacks an explicit formal argument. In the revision we will add a dedicated subsection in §3 presenting a formal soundness proof based on the abstract interpretation framework used, specifying the domains (interval analysis augmented with congruence for kernel parameters, and a flow-sensitive points-to analysis for pointers). We will also include a table showing, for each benchmark launch site, the concrete constraints extracted versus the actual runtime values observed, demonstrating that loops and conditionals are handled without introducing over-approximation that creates false races. This evidence will be drawn from the same benchmark set used in the evaluation. revision: yes

  2. Referee: [§5] §5 (Evaluation): the abstract asserts that HGRD “misses none” and “raises none” while iGUARD misses many and GPUVerify/FaialAA raise tens of false alarms, yet no table, benchmark list, or quantitative breakdown (true-positive count, false-positive count, runtime, memory) is referenced. Without these data the central performance claim cannot be assessed.

    Authors: We acknowledge that the abstract claims are not explicitly tied to the evaluation data in the current draft. The manuscript does contain Table 1 (benchmarks and results) and Table 2 (runtime/memory), but the text in §5 and the abstract do not reference them directly when stating the “misses none / raises none” results. In the revision we will (1) add explicit cross-references from the abstract and §5 to these tables, (2) expand the text to report the exact true-positive and false-positive counts for each tool on every benchmark, and (3) include the missing per-benchmark breakdown of runtime and memory. This will make the quantitative support for the central claims immediately verifiable. revision: partial

Circularity Check

0 steps flagged

No circularity: HGRD derives from independent static analysis of CPU launch sites and GPU kernels

full rationale

The paper's central technique extracts constraints on kernel parameters from host CPU code and feeds them into a GPU race checker. This is a standard program-analysis composition with no fitted parameters, no self-definitional equations, and no load-bearing self-citations that reduce the result to its own inputs. The abstract and description present the method as a direct static analysis of existing program text; no step renames a known result or smuggles an ansatz via prior work by the same authors. The derivation remains self-contained against the semantics of the analyzed programs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CPU launch code provides usable semantic constraints on GPU kernel parameters.

axioms (1)
  • domain assumption Host CPU code provides semantic information on GPU kernel parameter values during execution
    This is the key observation stated in the abstract that enables accurate race detection.

pith-pipeline@v0.9.0 · 5505 in / 1128 out tokens · 49161 ms · 2026-05-13T21:20:11.220718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Aho, Monica S

    Alfred V . Aho, Monica S. Lam, Ravi Sethi, and Jef- frey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Boston, MA, 2 edition, 2006

  2. [2]

    P3 instances with v100

    Amazon. P3 instances with v100. https://aws. amazon.com/ec2/instance-types/p3/, 2020

  3. [3]

    Don- aldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer

    Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Don- aldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. Engineering a static verification tool for gpu kernels. In Armin Biere and Roderick Bloem, editors, Com- puter Aided Verification, pages 226–242, Cham, 2014. Springer International Publishing

  4. [4]

    Donaldson

    Ethel Bardsley and Alastair F. Donaldson. Warps and atomics: Beyond barrier synchronization in the verification of gpu kernels. In Julia M. Badger and Kristin Yvonne Rozier, editors, NASA Formal Methods, pages 230–245, Cham, 2014. Springer International Pub- lishing

  5. [5]

    Bender, Jeremy T

    Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. On-the-fly maintenance of series-parallel relationships in fork-join multithreaded programs. In Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architec- tures, SPAA ’04, page 133–144, New York, NY , USA,

  6. [7]

    Gpuverify: a verifier for gpu kernels

    Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. Gpuverify: a verifier for gpu kernels. SIGPLAN Not., 47(10):113–132, October 2012

  7. [8]

    Bond, Katherine E

    Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. Pacer: proportional detection of data races. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, page 255–268, New York, NY , USA, 2010. Association for Computing Machinery

  8. [9]

    Boyer, K

    M. Boyer, K. Skadron, and W. Weimer. Automated dynamic analysis of cuda programs. In Third Workshop on Software Tools for MultiCore Systems, 2008

  9. [10]

    Krishna, Andreas Pavlogiannis, and Omkar Tuppe

    Soham Chakraborty, S. Krishna, Andreas Pavlogiannis, and Omkar Tuppe. Gpumc: A stateless model checker for gpu weak memory concurrency. In Computer Aided Verification: 37th International Conference, CAV 2025, Zagreb, Croatia, July 23-25, 2025, Proceedings, Part III, page 321–346, Berlin, Heidelberg, 2025. Springer- Verlag

  10. [11]

    Sheaffer, Sang-Ha Lee, and Kevin Skadron

    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous comput- ing. In 2009 IEEE International Symposium on Work- load Characterization (IISWC), 2009

  11. [12]

    Checking data-race free- dom of gpu kernels, compositionally

    Tiago Cogumbreiro, Julien Lange, Dennis Liew Zhen Rong, and Hannah Zicarelli. Checking data-race free- dom of gpu kernels, compositionally. In Computer Aided Verification: 33rd International Conference, CAV 2021, Virtual Event, July 20–23, 2021, Proceedings, Part I, page 403–426, Berlin, Heidelberg, 2021. Springer- Verlag

  12. [13]

    Cooper, Timothy J

    Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. A simple, fast dominance algorithm, 2001. Technical report, Rice University; widely cited implementation- oriented treatment

  13. [14]

    Verifying multi- threaded software using smt-based context-bounded model checking

    Lucas Cordeiro and Bernd Fischer. Verifying multi- threaded software using smt-based context-bounded model checking. In Proceedings of the 33rd Interna- tional Conference on Software Engineering, ICSE ’11, 13 page 331–340, New York, NY , USA, 2011. Association for Computing Machinery

  14. [15]

    Z3: an ef- ficient smt solver

    Leonardo De Moura and Nikolaj Bjørner. Z3: an ef- ficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analy- sis of Systems, TACAS’08/ETAPS’08, page 337–340, Berlin, Heidelberg, 2008. Springer-Verlag

  15. [16]

    Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. Ifrit: interference-free regions for dynamic data-race detection. In Proceed- ings of the ACM International Conference on Object Oriented Programming Systems Languages and Appli- cations, OOPSLA ’12, page 467–484, New York, NY , USA, 2012. Association for Computing Machinery

  16. [17]

    Barracuda: binary-level analysis of runtime races in cuda programs

    Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. Barracuda: binary-level analysis of runtime races in cuda programs. In Pro- ceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI 2017, page 126–140, New York, NY , USA, 2017. Association for Computing Machinery

  17. [18]

    Cormac Flanagan and Stephen N. Freund. Fasttrack: efficient and precise dynamic race detection. In Pro- ceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI ’09, page 121–133, New York, NY , USA, 2009. Association for Computing Machinery

  18. [19]

    Scoped racey (scor) benchmark suite.https:// github.com/csl-iisc/ScoR/tree/master, 2020

    Github. Scoped racey (scor) benchmark suite.https:// github.com/csl-iisc/ScoR/tree/master, 2020

  19. [20]

    Cloud gpus

    Google. Cloud gpus. https://cloud.google.com/ gpu/, 2019

  20. [21]

    Haccrg: Hardware-accelerated data race detection in gpus

    Anup Holey, Vineeth Mekkat, and Antonia Zhai. Haccrg: Hardware-accelerated data race detection in gpus. In Proceedings of the 2013 42nd International Conference on Parallel Processing, ICPP ’13, page 60–69, USA,

  21. [22]

    IEEE Computer Society

  22. [23]

    Expression trees

    IBM. Expression trees. https://www. ibm.com/docs/en/zos/3.1.0?topic= example-expression-trees, 2023

  23. [24]

    Hirace: Accurate and fast data race checking for gpu programs

    John Jacobson, Martin Burtscher, and Ganesh Gopalakr- ishnan. Hirace: Accurate and fast data race checking for gpu programs. In Proceedings of the International Con- ference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024

  24. [25]

    Zheming Jin and Jeffrey S. Vetter. A benchmark suite for improving performance portability of the sycl pro- gramming model. In 2023 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS), pages 325–327, 2023

  25. [26]

    Kamath and Arkaprava Basu

    Aditya K. Kamath and Arkaprava Basu. iguard: In- gpu advanced race detection. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP ’21, page 49–65, New York, NY , USA,

  26. [27]

    Association for Computing Machinery

  27. [28]

    Kamath, Alvin A

    Aditya K. Kamath, Alvin A. George, and Arkaprava Basu. Scord: a scoped race detector for gpus. In Pro- ceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA ’20, page 1036–1049. IEEE Press, 2020

  28. [29]

    Random ac- cess

    David Koester and Bob Lucas. Random ac- cess. https://icl.utk.edu/projectsfiles/hpcc/ RandomAccess/

  29. [30]

    Truly stateless, optimal dynamic partial order reduction

    Michalis Kokologiannakis, Iason Marmanis, Vladimir Gladstein, and Viktor Vafeiadis. Truly stateless, optimal dynamic partial order reduction. Proc. ACM Program. Lang., 6(POPL), January 2022

  30. [31]

    Mlir: scaling compiler infrastructure for do- main specific computation

    Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Ta- tiana Shpeisman, Nicolas Vasilache, and Oleksandr Zi- nenko. Mlir: scaling compiler infrastructure for do- main specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Genera- tion and Optimization, CGO ’21, page 2–14...

  31. [32]

    Thomas Lengauer and Robert E. Tarjan. A fast algo- rithm for finding dominators in a flowgraph. ACM Transactions on Programming Languages and Systems, 1(1):121–141, 1979

  32. [33]

    Scalable smt- based verification of gpu kernel functions

    Guodong Li and Ganesh Gopalakrishnan. Scalable smt- based verification of gpu kernel functions. In Proceed- ings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering , FSE ’10, page 187–196, New York, NY , USA, 2010. Association for Computing Machinery

  33. [34]

    Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakr- ishnan, Indradeep Ghosh, and Sreeranga P. Rajan. Gklee: concolic verification and test generation for gpus. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, page 215–224, New York, NY , USA, 2012. Association for Computing Machinery. 14

  34. [35]

    Zhang, and Chen Ding

    Pengcheng Li, Xiaoyu Hu, Dong Chen, Jacob Brock, Hao Luo, Eddy Z. Zhang, and Chen Ding. Ld: Low- overhead gpu race detection without access monitoring. ACM Trans. Archit. Code Optim., 14(1), March 2017

  35. [36]

    Donaldson

    Christopher Lidbury and Alastair F. Donaldson. Dy- namic race detection for c++11. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Pro- gramming Languages, POPL ’17, page 443–457, New York, NY , USA, 2017. Association for Computing Ma- chinery

  36. [37]

    Sound and partially-complete static analysis of data- races in gpu programs

    Dennis Liew, Tiago Cogumbreiro, and Julien Lange. Sound and partially-complete static analysis of data- races in gpu programs. Proc. ACM Program. Lang. , 8(OOPSLA2), October 2024

  37. [38]

    The indigo program-verification microbench- mark suite of irregular parallel code patterns

    Yiqian Liu, Noushin Azami, Corbin Walters, and Martin Burtscher. The indigo program-verification microbench- mark suite of irregular parallel code patterns. In 2022 IEEE International Symposium on Performance Anal- ysis of Systems and Software (ISPASS) , pages 24–34, 2022

  38. [39]

    Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

    Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. SIGOPS Oper. Syst. Rev., 42(2):329–339, March 2008

  39. [40]

    Avio: detecting atomicity violations via access interleav- ing invariants

    Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: detecting atomicity violations via access interleav- ing invariants. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, page 37–48, New York, NY , USA, 2006. Association for Com- puting Machinery

  40. [41]

    Mlir mlir::dominanceinfo class reference

    LLVM MLIR. Mlir mlir::dominanceinfo class reference. https://mlir.llvm.org/doxygen/classmlir_1_ 1DominanceInfo.html, 2025

  41. [42]

    Mlir mlir::value class reference

    LLVM MLIR. Mlir mlir::value class reference. https://mlir.llvm.org/doxygen/classmlir_1_ 1Value.html, 2025

  42. [43]

    Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko

    William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising c to polyhe- dral mlir. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’21, New York, NY , USA, 2021. As- sociation for Computing Machinery

  43. [44]

    Over-synchronization in gpu programs

    Ajay Nayak and Arkaprava Basu. Over-synchronization in gpu programs. In 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 795– 809, 2024

  44. [45]

    Gpus everywhere

    NVIDIA. Gpus everywhere. https: //blogs.nvidia.com/blog/2017/05/08/ microsoft-azure-gpu-instances/ , 2019

  45. [46]

    Compute sanitizer

    NVIDIA. Compute sanitizer. https://docs.nvidia. com/compute-sanitizer/ComputeSanitizer/ index.html, 2025

  46. [47]

    Cuda c++ programming guide

    NVIDIA. Cuda c++ programming guide. https://docs.nvidia.com/cuda/ cuda-c-programming-guide/ , 2025

  47. [48]

    Cuda samples

    NVIDIA. Cuda samples. https://docs.nvidia. com/cuda/cuda-samples/, 2025

  48. [49]

    Tuning cuda applications for volta

    NVIDIA. Tuning cuda applications for volta. https://docs.nvidia.com/cuda/ volta-tuning-guide/index.html, 2025

  49. [50]

    Curd: a dynamic cuda race detector

    Yuanfeng Peng, Vinod Grover, and Joseph Devietti. Curd: a dynamic cuda race detector. SIGPLAN Not., 53(4):390–403, June 2018

  50. [51]

    Laurent Perron and Frédéric Didier. Cp-sat. https://developers.google.com/optimization/ cp/cp_solver/, 2024

  51. [52]

    The pytorch-kaldi speech recognition toolkit

    Mirco Ravanelli, Titouan Parcollet, and Yoshua Ben- gio. The pytorch-kaldi speech recognition toolkit. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6465–6469, 2019

  52. [53]

    Eraser: a dynamic data race detector for multithreaded programs

    Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391–411, November 1997

  53. [54]

    Threadsanitizer: data race detection in practice

    Konstantin Serebryany and Timur Iskhodzhanov. Threadsanitizer: data race detection in practice. In Pro- ceedings of the Workshop on Binary Instrumentation and Applications, WBIA ’09, page 62–71, New York, NY , USA, 2009. Association for Computing Machinery

  54. [55]

    Gklee tests

    Tyler Sorensen. Gklee tests. https://github. com/Geof23/Gklee/blob/master/Gklee/CUDA/ Benchmarks/Misc_Test/Tyler, 2013

  55. [56]

    Towards unified analysis of gpu consistency

    Haining Tong, Natalia Gavrilenko, Hernan Ponce de Leon, and Keijo Heljanko. Towards unified analysis of gpu consistency. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 4, ASPLOS ’24, page 329–344, New York, NY , USA, 2025. Association for Computing Machinery

  56. [57]

    Swift, and Shan Lu

    Haris V olos, Andres Jaan Tack, Michael M. Swift, and Shan Lu. Applying transactional memory to concur- rency bugs. SIGPLAN Not. , 47(4):211–222, March 2012. 15

  57. [58]

    Wood, Luis Ceze, and Dan Grossman

    Benjamin P. Wood, Luis Ceze, and Dan Grossman. Low- level detection of language-level data races with lard. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, page 671–686, New York, NY , USA, 2014. Association for Computing Machinery

  58. [59]

    Simulee: detecting cuda synchronization bugs via memory-access modeling

    Mingyuan Wu, Yicheng Ouyang, Husheng Zhou, Ling- ming Zhang, Cong Liu, and Yuqun Zhang. Simulee: detecting cuda synchronization bugs via memory-access modeling. In Proceedings of the ACM/IEEE 42nd In- ternational Conference on Software Engineering, ICSE ’20, page 937–948, New York, NY , USA, 2020. Associ- ation for Computing Machinery

  59. [60]

    Conseq: detecting concurrency bugs through sequential errors

    Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz, Guoliang Jin, Shan Lu, and Thomas Reps. Conseq: detecting concurrency bugs through sequential errors. SIGPLAN Not., 46(3):251–264, March 2011

  60. [61]

    Ravi, Feng Qin, and Gagan Agrawal

    Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. Grace: a low-overhead mechanism for de- tecting data races in gpu programs. SIGPLAN Not., 46(8):135–146, February 2011

  61. [62]

    Ravi, Feng Qin, and Gagan Agrawal

    Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. Gmrace: Detecting data races in gpu pro- grams via a low-overhead scheme. IEEE Trans. Parallel Distrib. Syst., 25(1):104–115, January 2014

  62. [63]

    Hard: Hardware-assisted lockset-based race detection

    Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. Hard: Hardware-assisted lockset-based race detection. In 2007 IEEE 13th International Symposium on High Perfor- mance Computer Architecture, pages 121–132, 2007. A Supporting Scoped Operations A.1 Background on scopes Synchronizing across all the threads of a GPU kernel is slow and often unnecessary due to the...