arxiv: 2604.02106 · v2 · submitted 2026-04-02 · 💻 cs.SE

Towards an Accurate GPU Data Race Detector

Ajay Nayak , Anubhab Ghosh , Arkaprava Basu This is my paper

Pith reviewed 2026-05-13 21:20 UTC · model grok-4.3

classification 💻 cs.SE

keywords GPU data race detectionstatic analysisholistic analysisCPU-GPU interactionkernel launchfalse alarmsparameter constraints

0 comments

The pith

Analyzing CPU launch code lets a static detector find every true GPU data race with zero false alarms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HGRD, a static analysis that jointly examines the CPU host code responsible for launching GPU kernels and the GPU kernel code itself. By pulling semantic constraints on kernel parameter values directly from the launch sites in the CPU code, the analysis determines which potential data races can actually occur during execution. This approach detects every true race that dynamic detectors miss because they only observe executed paths, while avoiding the false alarms that arise when static methods assume all possible parameter values. If the claim holds, developers gain a practical way to ensure GPU program correctness without trading off coverage for precision.

Core claim

HGRD performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques such as iGUARD miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.

What carries the argument

HGRD's holistic static analysis that extracts and applies semantic constraints on GPU kernel parameter values from CPU kernel-launch code.

Load-bearing premise

The CPU host code contains extractable semantic information about the values that GPU kernel parameters can actually take during execution.

What would settle it

A GPU program and its CPU launcher where HGRD reports a race that never occurs under any input or misses a race that does occur.

Figures

Figures reproduced from arXiv: 2604.02106 by Ajay Nayak, Anubhab Ghosh, Arkaprava Basu.

**Figure 2.** Figure 2: Host code setting thread grid dimensions. 1 Asserts in the host code: Programmers often add assert statements to ensure that specific conditions are met in every execution of the program. We demonstrate how leveraging assert conditions can improve the accuracy of race detection with an example [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Host code creating kernel parameter relations. 1 // Kernel code 2 __global__ 3 void ludInternal(float *m, int dim, int offset) { 4 rid = offset + (blockIdx.y + 1) * BLOCK_SIZE; 5 cid = offset + (blockIdx.x + 1) * BLOCK_SIZE; 6 // compute value of sum 7 m[(rid + threadIdx.y) * dim + cid + threadIdx.x] = sum; 8 } 9 // Host code 10 void lud(float *m, int hMatrixDim, ...) { 11 // hOffset controlling values of … view at source ↗

**Figure 4.** Figure 4: Host code adding bounds to kernel parameters. 3 Relation among kernel parameters: It is typical for the host code to derive different kernel parameters from the same variable, thereby implicitly creating a relation between the otherwise seemingly independent parameters. We illustrate this with the toneMapping program from HeCBench [23] (Figure 3). It compresses a high dynamic range image to a lower range.… view at source ↗

**Figure 6.** Figure 6: depicts the high-level working of HGRD, highlighting our key contributions with colored elements. It compiles a CUDA program consisting of the host and GPU kernel code using CGeist [40] to MLIR intermediate representation [28]. MLIR enables easy analysis of the code. HGRD has five major components: C1 It analyzes kernel code’s MLIR to find pairs of memory access instructions that can never race (e.g., pai… view at source ↗

**Figure 7.** Figure 7: Expression tree for assert condition in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Creating expression trees from host code analysis. they have Unknown value as they are user inputs. The root node captures the condition among the two leaf nodes (==). Figure 8a shows the simplified code snippet that creates ETs (createET) for an MLIR operation (e.g., a condition). It recursively creates ETs for its constituents, e.g., the left and right operands of a binary operation (lines 8-9), until it… view at source ↗

**Figure 9.** Figure 9: Getting acquire/release for an access [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Getting scoped acquire/release for an access. cesses, HGRD extends the syn method to receive the scope argument ( [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Data races in GPU programs pose a threat to the reliability of GPU-accelerated software stacks. Prior works proposed various dynamic (runtime) and static (compile-time) techniques to detect races in GPU programs. However, dynamic techniques often miss critical races, as they require the races to manifest during testing. While static ones can catch such races, they often generate numerous false alarms by conservatively assuming values of variables/parameters that cannot ever occur during any execution of the program. We make a key observation that the host (CPU) code that launches GPU kernels contains crucial semantic information about the values that the GPU kernel's parameters can take during execution. Harnessing this hitherto overlooked information helps accurately detect data races in GPU kernel code. We create HGRD, a new state-of-the-art static analysis technique that performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques, such as iGUARD, miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HGRD claims a holistic host-device static analysis can eliminate both missed races and false alarms in GPU detection, but the abstract supplies no algorithm, benchmarks, or precision evidence to support that.

read the letter

The paper's main contribution is HGRD, a static technique that pulls constraints on kernel parameters directly from the CPU launch site instead of treating the GPU code in isolation. This is presented as a way to tighten the analysis enough to catch every true race that dynamic tools like iGUARD miss while avoiding the flood of false alarms that static tools like GPUVerify and FaialAA produce. The observation that host code carries usable semantic information about parameter values is straightforward and had been overlooked in the cited prior work, so that angle is genuinely new within the subfield. If the host extraction stays both sound and precise, the approach could reduce the usual trade-off between coverage and noise. The abstract states that HGRD misses none and raises none on the programs they looked at, which would be a useful result if it holds. The soft spot is that none of this is backed by any concrete data. There are no algorithm details, no description of how the host analysis handles loops, conditionals, or pointer arithmetic in the launcher, and no quantitative results or error analysis at all. The stress-test concern about precision loss on realistic control flow therefore lands directly: an over-approximation in the host part would reintroduce false alarms, and the paper gives no indication that their extraction avoids this on non-trivial code. Without those details the zero-error claim cannot be assessed. This is the kind of paper that belongs in a reading group for people working on GPU verification or static analysis for heterogeneous programs. A reader already familiar with iGUARD or GPUVerify would see the new combination clearly and could judge whether the host analysis is worth pursuing once the full evaluation appears. I would send it to peer review so referees can check the implementation and the actual numbers against the strong claims in the abstract.

Referee Report

2 major / 2 minor

Summary. The paper introduces HGRD, a static analysis technique for GPU data-race detection that performs holistic analysis of both CPU host code (which launches kernels) and GPU kernel code. It claims this yields a broad set of true races with zero misses (unlike dynamic tools such as iGUARD) and zero false alarms (unlike static tools such as GPUVerify and FaialAA) by extracting precise constraints on kernel parameters from the host launch site.

Significance. If the central claim holds, the work would be a meaningful contribution to GPU program reliability: it directly addresses the well-known precision gap between dynamic (incomplete) and static (over-approximating) race detectors by exploiting host-kernel semantic linkage that prior GPU race-detection literature has largely ignored.

major comments (2)

[Abstract and §3] Abstract and §3 (Host Analysis): the claim that host-code extraction is both sound and sufficiently precise to eliminate all false alarms while preserving every true race is load-bearing. The skeptic note correctly identifies that any over-approximation arising from loops, conditionals, or pointer arithmetic in the launcher will re-introduce spurious races; the manuscript must supply either a formal soundness argument or concrete evidence (e.g., the exact abstract domains used and their precision on the benchmark launch sites) that this does not occur for the evaluated programs.
[§5] §5 (Evaluation): the abstract asserts that HGRD “misses none” and “raises none” while iGUARD misses many and GPUVerify/FaialAA raise tens of false alarms, yet no table, benchmark list, or quantitative breakdown (true-positive count, false-positive count, runtime, memory) is referenced. Without these data the central performance claim cannot be assessed.

minor comments (2)

[§2] Clarify the exact race classes covered (intra-kernel only, inter-kernel, host-device, etc.) and whether the analysis is flow-, context-, or path-sensitive.
[§6] Add a short related-work paragraph contrasting HGRD with recent hybrid CPU-GPU analyses (e.g., any post-2022 papers on host-kernel parameter inference).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and constructive feedback. We appreciate the identification of areas where the manuscript's claims require stronger support. We address each major comment below and will revise the paper to incorporate the requested clarifications and evidence.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Host Analysis): the claim that host-code extraction is both sound and sufficiently precise to eliminate all false alarms while preserving every true race is load-bearing. The skeptic note correctly identifies that any over-approximation arising from loops, conditionals, or pointer arithmetic in the launcher will re-introduce spurious races; the manuscript must supply either a formal soundness argument or concrete evidence (e.g., the exact abstract domains used and their precision on the benchmark launch sites) that this does not occur for the evaluated programs.

Authors: We agree that the soundness and precision of host-code extraction is central and load-bearing. The current manuscript sketches the analysis but lacks an explicit formal argument. In the revision we will add a dedicated subsection in §3 presenting a formal soundness proof based on the abstract interpretation framework used, specifying the domains (interval analysis augmented with congruence for kernel parameters, and a flow-sensitive points-to analysis for pointers). We will also include a table showing, for each benchmark launch site, the concrete constraints extracted versus the actual runtime values observed, demonstrating that loops and conditionals are handled without introducing over-approximation that creates false races. This evidence will be drawn from the same benchmark set used in the evaluation. revision: yes
Referee: [§5] §5 (Evaluation): the abstract asserts that HGRD “misses none” and “raises none” while iGUARD misses many and GPUVerify/FaialAA raise tens of false alarms, yet no table, benchmark list, or quantitative breakdown (true-positive count, false-positive count, runtime, memory) is referenced. Without these data the central performance claim cannot be assessed.

Authors: We acknowledge that the abstract claims are not explicitly tied to the evaluation data in the current draft. The manuscript does contain Table 1 (benchmarks and results) and Table 2 (runtime/memory), but the text in §5 and the abstract do not reference them directly when stating the “misses none / raises none” results. In the revision we will (1) add explicit cross-references from the abstract and §5 to these tables, (2) expand the text to report the exact true-positive and false-positive counts for each tool on every benchmark, and (3) include the missing per-benchmark breakdown of runtime and memory. This will make the quantitative support for the central claims immediately verifiable. revision: partial

Circularity Check

0 steps flagged

No circularity: HGRD derives from independent static analysis of CPU launch sites and GPU kernels

full rationale

The paper's central technique extracts constraints on kernel parameters from host CPU code and feeds them into a GPU race checker. This is a standard program-analysis composition with no fitted parameters, no self-definitional equations, and no load-bearing self-citations that reduce the result to its own inputs. The abstract and description present the method as a direct static analysis of existing program text; no step renames a known result or smuggles an ansatz via prior work by the same authors. The derivation remains self-contained against the semantics of the analyzed programs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CPU launch code provides usable semantic constraints on GPU kernel parameters.

axioms (1)

domain assumption Host CPU code provides semantic information on GPU kernel parameter values during execution
This is the key observation stated in the abstract that enables accurate race detection.

pith-pipeline@v0.9.0 · 5505 in / 1128 out tokens · 49161 ms · 2026-05-13T21:20:11.220718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HGRD analyzes the host code to extract five classes of semantic information... asserts, thread grid dimensions, relation among kernel parameters, loop bounds, allocation size
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HGRD invokes a SAT solver to check satisfiability of constraints for finding pairs of memory instructions that may race

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Aho, Monica S

Alfred V . Aho, Monica S. Lam, Ravi Sethi, and Jef- frey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Boston, MA, 2 edition, 2006

work page 2006
[2]

P3 instances with v100

Amazon. P3 instances with v100. https://aws. amazon.com/ec2/instance-types/p3/, 2020

work page 2020
[3]

Don- aldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer

Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Don- aldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. Engineering a static verification tool for gpu kernels. In Armin Biere and Roderick Bloem, editors, Com- puter Aided Verification, pages 226–242, Cham, 2014. Springer International Publishing

work page 2014
[4]

Donaldson

Ethel Bardsley and Alastair F. Donaldson. Warps and atomics: Beyond barrier synchronization in the verification of gpu kernels. In Julia M. Badger and Kristin Yvonne Rozier, editors, NASA Formal Methods, pages 230–245, Cham, 2014. Springer International Pub- lishing

work page 2014
[5]

Bender, Jeremy T

Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. On-the-fly maintenance of series-parallel relationships in fork-join multithreaded programs. In Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architec- tures, SPAA ’04, page 133–144, New York, NY , USA,

work page
[7]

Gpuverify: a verifier for gpu kernels

Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. Gpuverify: a verifier for gpu kernels. SIGPLAN Not., 47(10):113–132, October 2012

work page 2012
[8]

Bond, Katherine E

Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. Pacer: proportional detection of data races. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, page 255–268, New York, NY , USA, 2010. Association for Computing Machinery

work page 2010
[9]

Boyer, K

M. Boyer, K. Skadron, and W. Weimer. Automated dynamic analysis of cuda programs. In Third Workshop on Software Tools for MultiCore Systems, 2008

work page 2008
[10]

Krishna, Andreas Pavlogiannis, and Omkar Tuppe

Soham Chakraborty, S. Krishna, Andreas Pavlogiannis, and Omkar Tuppe. Gpumc: A stateless model checker for gpu weak memory concurrency. In Computer Aided Verification: 37th International Conference, CAV 2025, Zagreb, Croatia, July 23-25, 2025, Proceedings, Part III, page 321–346, Berlin, Heidelberg, 2025. Springer- Verlag

work page 2025
[11]

Sheaffer, Sang-Ha Lee, and Kevin Skadron

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous comput- ing. In 2009 IEEE International Symposium on Work- load Characterization (IISWC), 2009

work page 2009
[12]

Checking data-race free- dom of gpu kernels, compositionally

Tiago Cogumbreiro, Julien Lange, Dennis Liew Zhen Rong, and Hannah Zicarelli. Checking data-race free- dom of gpu kernels, compositionally. In Computer Aided Verification: 33rd International Conference, CAV 2021, Virtual Event, July 20–23, 2021, Proceedings, Part I, page 403–426, Berlin, Heidelberg, 2021. Springer- Verlag

work page 2021
[13]

Cooper, Timothy J

Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. A simple, fast dominance algorithm, 2001. Technical report, Rice University; widely cited implementation- oriented treatment

work page 2001
[14]

Verifying multi- threaded software using smt-based context-bounded model checking

Lucas Cordeiro and Bernd Fischer. Verifying multi- threaded software using smt-based context-bounded model checking. In Proceedings of the 33rd Interna- tional Conference on Software Engineering, ICSE ’11, 13 page 331–340, New York, NY , USA, 2011. Association for Computing Machinery

work page 2011
[15]

Z3: an ef- ficient smt solver

Leonardo De Moura and Nikolaj Bjørner. Z3: an ef- ficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analy- sis of Systems, TACAS’08/ETAPS’08, page 337–340, Berlin, Heidelberg, 2008. Springer-Verlag

work page 2008
[16]

Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. Ifrit: interference-free regions for dynamic data-race detection. In Proceed- ings of the ACM International Conference on Object Oriented Programming Systems Languages and Appli- cations, OOPSLA ’12, page 467–484, New York, NY , USA, 2012. Association for Computing Machinery

work page 2012
[17]

Barracuda: binary-level analysis of runtime races in cuda programs

Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. Barracuda: binary-level analysis of runtime races in cuda programs. In Pro- ceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI 2017, page 126–140, New York, NY , USA, 2017. Association for Computing Machinery

work page 2017
[18]

Cormac Flanagan and Stephen N. Freund. Fasttrack: efficient and precise dynamic race detection. In Pro- ceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI ’09, page 121–133, New York, NY , USA, 2009. Association for Computing Machinery

work page 2009
[19]

Scoped racey (scor) benchmark suite.https:// github.com/csl-iisc/ScoR/tree/master, 2020

Github. Scoped racey (scor) benchmark suite.https:// github.com/csl-iisc/ScoR/tree/master, 2020

work page 2020
[20]

Cloud gpus

Google. Cloud gpus. https://cloud.google.com/ gpu/, 2019

work page 2019
[21]

Haccrg: Hardware-accelerated data race detection in gpus

Anup Holey, Vineeth Mekkat, and Antonia Zhai. Haccrg: Hardware-accelerated data race detection in gpus. In Proceedings of the 2013 42nd International Conference on Parallel Processing, ICPP ’13, page 60–69, USA,

work page 2013
[22]

IEEE Computer Society

work page
[23]

Expression trees

IBM. Expression trees. https://www. ibm.com/docs/en/zos/3.1.0?topic= example-expression-trees, 2023

work page 2023
[24]

Hirace: Accurate and fast data race checking for gpu programs

John Jacobson, Martin Burtscher, and Ganesh Gopalakr- ishnan. Hirace: Accurate and fast data race checking for gpu programs. In Proceedings of the International Con- ference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024

work page 2024
[25]

Zheming Jin and Jeffrey S. Vetter. A benchmark suite for improving performance portability of the sycl pro- gramming model. In 2023 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS), pages 325–327, 2023

work page 2023
[26]

Kamath and Arkaprava Basu

Aditya K. Kamath and Arkaprava Basu. iguard: In- gpu advanced race detection. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP ’21, page 49–65, New York, NY , USA,

work page
[27]

Association for Computing Machinery

work page
[28]

Kamath, Alvin A

Aditya K. Kamath, Alvin A. George, and Arkaprava Basu. Scord: a scoped race detector for gpus. In Pro- ceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA ’20, page 1036–1049. IEEE Press, 2020

work page 2020
[29]

Random ac- cess

David Koester and Bob Lucas. Random ac- cess. https://icl.utk.edu/projectsfiles/hpcc/ RandomAccess/

work page
[30]

Truly stateless, optimal dynamic partial order reduction

Michalis Kokologiannakis, Iason Marmanis, Vladimir Gladstein, and Viktor Vafeiadis. Truly stateless, optimal dynamic partial order reduction. Proc. ACM Program. Lang., 6(POPL), January 2022

work page 2022
[31]

Mlir: scaling compiler infrastructure for do- main specific computation

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Ta- tiana Shpeisman, Nicolas Vasilache, and Oleksandr Zi- nenko. Mlir: scaling compiler infrastructure for do- main specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Genera- tion and Optimization, CGO ’21, page 2–14...

work page 2021
[32]

Thomas Lengauer and Robert E. Tarjan. A fast algo- rithm for finding dominators in a flowgraph. ACM Transactions on Programming Languages and Systems, 1(1):121–141, 1979

work page 1979
[33]

Scalable smt- based verification of gpu kernel functions

Guodong Li and Ganesh Gopalakrishnan. Scalable smt- based verification of gpu kernel functions. In Proceed- ings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering , FSE ’10, page 187–196, New York, NY , USA, 2010. Association for Computing Machinery

work page 2010
[34]

Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakr- ishnan, Indradeep Ghosh, and Sreeranga P. Rajan. Gklee: concolic verification and test generation for gpus. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, page 215–224, New York, NY , USA, 2012. Association for Computing Machinery. 14

work page 2012
[35]

Zhang, and Chen Ding

Pengcheng Li, Xiaoyu Hu, Dong Chen, Jacob Brock, Hao Luo, Eddy Z. Zhang, and Chen Ding. Ld: Low- overhead gpu race detection without access monitoring. ACM Trans. Archit. Code Optim., 14(1), March 2017

work page 2017
[36]

Donaldson

Christopher Lidbury and Alastair F. Donaldson. Dy- namic race detection for c++11. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Pro- gramming Languages, POPL ’17, page 443–457, New York, NY , USA, 2017. Association for Computing Ma- chinery

work page 2017
[37]

Sound and partially-complete static analysis of data- races in gpu programs

Dennis Liew, Tiago Cogumbreiro, and Julien Lange. Sound and partially-complete static analysis of data- races in gpu programs. Proc. ACM Program. Lang. , 8(OOPSLA2), October 2024

work page 2024
[38]

The indigo program-verification microbench- mark suite of irregular parallel code patterns

Yiqian Liu, Noushin Azami, Corbin Walters, and Martin Burtscher. The indigo program-verification microbench- mark suite of irregular parallel code patterns. In 2022 IEEE International Symposium on Performance Anal- ysis of Systems and Software (ISPASS) , pages 24–34, 2022

work page 2022
[39]

Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. SIGOPS Oper. Syst. Rev., 42(2):329–339, March 2008

work page 2008
[40]

Avio: detecting atomicity violations via access interleav- ing invariants

Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: detecting atomicity violations via access interleav- ing invariants. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, page 37–48, New York, NY , USA, 2006. Association for Com- puting Machinery

work page 2006
[41]

Mlir mlir::dominanceinfo class reference

LLVM MLIR. Mlir mlir::dominanceinfo class reference. https://mlir.llvm.org/doxygen/classmlir_1_ 1DominanceInfo.html, 2025

work page 2025
[42]

Mlir mlir::value class reference

LLVM MLIR. Mlir mlir::value class reference. https://mlir.llvm.org/doxygen/classmlir_1_ 1Value.html, 2025

work page 2025
[43]

Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko

William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising c to polyhe- dral mlir. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’21, New York, NY , USA, 2021. As- sociation for Computing Machinery

work page 2021
[44]

Over-synchronization in gpu programs

Ajay Nayak and Arkaprava Basu. Over-synchronization in gpu programs. In 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 795– 809, 2024

work page 2024
[45]

Gpus everywhere

NVIDIA. Gpus everywhere. https: //blogs.nvidia.com/blog/2017/05/08/ microsoft-azure-gpu-instances/ , 2019

work page 2017
[46]

Compute sanitizer

NVIDIA. Compute sanitizer. https://docs.nvidia. com/compute-sanitizer/ComputeSanitizer/ index.html, 2025

work page 2025
[47]

Cuda c++ programming guide

NVIDIA. Cuda c++ programming guide. https://docs.nvidia.com/cuda/ cuda-c-programming-guide/ , 2025

work page 2025
[48]

Cuda samples

NVIDIA. Cuda samples. https://docs.nvidia. com/cuda/cuda-samples/, 2025

work page 2025
[49]

Tuning cuda applications for volta

NVIDIA. Tuning cuda applications for volta. https://docs.nvidia.com/cuda/ volta-tuning-guide/index.html, 2025

work page 2025
[50]

Curd: a dynamic cuda race detector

Yuanfeng Peng, Vinod Grover, and Joseph Devietti. Curd: a dynamic cuda race detector. SIGPLAN Not., 53(4):390–403, June 2018

work page 2018
[51]

Laurent Perron and Frédéric Didier. Cp-sat. https://developers.google.com/optimization/ cp/cp_solver/, 2024

work page 2024
[52]

The pytorch-kaldi speech recognition toolkit

Mirco Ravanelli, Titouan Parcollet, and Yoshua Ben- gio. The pytorch-kaldi speech recognition toolkit. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6465–6469, 2019

work page 2019
[53]

Eraser: a dynamic data race detector for multithreaded programs

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391–411, November 1997

work page 1997
[54]

Threadsanitizer: data race detection in practice

Konstantin Serebryany and Timur Iskhodzhanov. Threadsanitizer: data race detection in practice. In Pro- ceedings of the Workshop on Binary Instrumentation and Applications, WBIA ’09, page 62–71, New York, NY , USA, 2009. Association for Computing Machinery

work page 2009
[55]

Gklee tests

Tyler Sorensen. Gklee tests. https://github. com/Geof23/Gklee/blob/master/Gklee/CUDA/ Benchmarks/Misc_Test/Tyler, 2013

work page 2013
[56]

Towards unified analysis of gpu consistency

Haining Tong, Natalia Gavrilenko, Hernan Ponce de Leon, and Keijo Heljanko. Towards unified analysis of gpu consistency. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 4, ASPLOS ’24, page 329–344, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[57]

Swift, and Shan Lu

Haris V olos, Andres Jaan Tack, Michael M. Swift, and Shan Lu. Applying transactional memory to concur- rency bugs. SIGPLAN Not. , 47(4):211–222, March 2012. 15

work page 2012
[58]

Wood, Luis Ceze, and Dan Grossman

Benjamin P. Wood, Luis Ceze, and Dan Grossman. Low- level detection of language-level data races with lard. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, page 671–686, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014
[59]

Simulee: detecting cuda synchronization bugs via memory-access modeling

Mingyuan Wu, Yicheng Ouyang, Husheng Zhou, Ling- ming Zhang, Cong Liu, and Yuqun Zhang. Simulee: detecting cuda synchronization bugs via memory-access modeling. In Proceedings of the ACM/IEEE 42nd In- ternational Conference on Software Engineering, ICSE ’20, page 937–948, New York, NY , USA, 2020. Associ- ation for Computing Machinery

work page 2020
[60]

Conseq: detecting concurrency bugs through sequential errors

Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz, Guoliang Jin, Shan Lu, and Thomas Reps. Conseq: detecting concurrency bugs through sequential errors. SIGPLAN Not., 46(3):251–264, March 2011

work page 2011
[61]

Ravi, Feng Qin, and Gagan Agrawal

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. Grace: a low-overhead mechanism for de- tecting data races in gpu programs. SIGPLAN Not., 46(8):135–146, February 2011

work page 2011
[62]

Ravi, Feng Qin, and Gagan Agrawal

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. Gmrace: Detecting data races in gpu pro- grams via a low-overhead scheme. IEEE Trans. Parallel Distrib. Syst., 25(1):104–115, January 2014

work page 2014
[63]

Hard: Hardware-assisted lockset-based race detection

Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. Hard: Hardware-assisted lockset-based race detection. In 2007 IEEE 13th International Symposium on High Perfor- mance Computer Architecture, pages 121–132, 2007. A Supporting Scoped Operations A.1 Background on scopes Synchronizing across all the threads of a GPU kernel is slow and often unnecessary due to the...

work page 2007