DeepRace: Finding Data Race Bugs via Deep Learning
Pith reviewed 2026-05-24 21:35 UTC · model grok-4.3
The pith
A one-layer CNN classifies parallel source code for data races and localizes buggy lines via activation maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepRace trains a one-layer convolutional neural network on labeled OpenMP and POSIX source files to classify them for data race presence and then uses class activation maps to back-propagate from the final convolutional layer, identifying the exact lines responsible for the race; on the test collections this yields 81-86% file accuracy, 66% line IoU, and between one and ten false positives or negatives per file.
What carries the argument
One-layer CNN with multiple window sizes plus class activation mapping and global average pooling that back-propagates to mark buggy lines in the input source code.
If this is right
- The model classifies buggy source files at 81-86% accuracy on the OpenMP and POSIX collections.
- It localizes buggy lines inside those files at 66% IoU while producing only 1-10 false positives or negatives.
- It can flag multiple buggy lines at different positions within the same file.
- The same network architecture handles both file-level classification and line-level localization without separate detectors.
Where Pith is reading between the lines
- Retraining the same CNN architecture on labels for other concurrency issues such as deadlocks could extend the method beyond races.
- Embedding the model in an IDE could surface candidate race lines while a developer is editing parallel code.
- Measuring performance on parallel code written in languages other than C/C++ would test how far the learned patterns transfer.
- Raising line-level IoU above 66% would be required before the localization output becomes reliable enough for automated repair suggestions.
Load-bearing premise
The training labels correctly mark which files and lines contain data races and the labeled collections represent the code the model will see in practice.
What would settle it
Evaluating the trained model on a fresh collection of parallel programs whose data-race locations have been independently confirmed by an existing static or dynamic detector and checking whether file accuracy stays above 80% and line IoU stays above 60%.
Figures
read the original abstract
With the proliferation of multi-core hardware, parallel programs have become ubiquitous. These programs have their own type of bugs known as concurrency bugs and among them, data race bugs have been mostly in the focus of researchers over the past decades. In fact, detecting data races is a very challenging and important task. There have been several research paths in this area with many sophisticated tools designed and utilized that focus on detecting data race at the file level. In this paper, we propose DeepRace, a novel approach toward detecting data races in the source code. We build a deep neural network model to find data races instead of creating a data race detector manually. Our model uses a one-layer convolutional neural network (CNN) with different window size to find data races method. Then we adopt the class activation map function with global average pooling to extract the weights of the last convolutional layer and backpropagate it with the input source code to extract the line of codes with a data race. Thus, the DeepRace model can detect the data race bugs on a file and line of code level. In addition, we noticed that DeepRace successfully detects several buggy lines of code at different locations of the file. We tested the model with OpenMP and POSIX source code datasets which consist of more than 5000 and 8000 source code files respectively. We were able to successfully classify buggy source code files and achieve accuracies ranging from 81% and 86%. We also measured the performance of detecting and visualizing the data race at the line of code levels and our model achieved promising results. We only had a small number of false positives and false, ranging from 1 to 10. Furthermore, we used the intersection of union to measure the accuracy of the buggy lines of code, our model achieved promising results of 66 percent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DeepRace, a one-layer CNN model augmented with class activation mapping and global average pooling, to detect data-race bugs in source code at both file and line granularity. It reports file-level classification accuracies of 81–86% and a 66% IoU for localizing buggy lines on labeled OpenMP (>5000 files) and POSIX (>8000 files) collections, claiming the model can identify multiple buggy lines at different locations with few false positives.
Significance. If the empirical claims can be substantiated with transparent labeling protocols, reproducible splits, and appropriate baselines, the work would constitute a concrete demonstration that a lightweight CNN can perform both classification and localization of concurrency bugs on realistic code corpora. Such a result would be of interest to the concurrency-analysis community as an existence proof for learned detectors, though its practical impact would still depend on generalization beyond the training labelers.
major comments (3)
- [Abstract] Abstract: the central numerical claims (81–86% file accuracy, 66% line IoU) are presented without any description of the labeling protocol used to mark buggy files or specific lines in the OpenMP and POSIX collections; because data-race ground truth is itself approximate, the absence of this information makes the reported metrics unverifiable and prevents assessment of whether the model has learned genuine races or merely the labeler’s biases.
- [Abstract] Abstract: no information is supplied on train/test split methodology, baseline detectors, hyper-parameter search procedure, or statistical tests; without these standard controls the accuracy and IoU figures cannot be interpreted as evidence of superiority or even reliability.
- [Abstract] The supervised-learning pipeline implicitly assumes that the external labels constitute an independent and sufficiently complete oracle; the manuscript provides no evidence that the label source (static analyzer, dynamic detector, or manual annotation) was independent of the model or that coverage gaps were quantified.
minor comments (2)
- [Abstract] Abstract contains a typographical error: “false, ranging from 1 to 10” should read “false negatives”.
- [Abstract] The claim that the model “successfully detects several buggy lines of code at different locations” is stated without quantitative support beyond the aggregate IoU figure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central numerical claims (81–86% file accuracy, 66% line IoU) are presented without any description of the labeling protocol used to mark buggy files or specific lines in the OpenMP and POSIX collections; because data-race ground truth is itself approximate, the absence of this information makes the reported metrics unverifiable and prevents assessment of whether the model has learned genuine races or merely the labeler’s biases.
Authors: We agree that the labeling protocol is necessary to interpret the results. The manuscript does not describe how the ground-truth labels were generated for the OpenMP and POSIX collections. In the revised version we will add a dedicated subsection that specifies the labeling procedure, the tools or annotators involved, and any known limitations of the ground truth. revision: yes
-
Referee: [Abstract] Abstract: no information is supplied on train/test split methodology, baseline detectors, hyper-parameter search procedure, or statistical tests; without these standard controls the accuracy and IoU figures cannot be interpreted as evidence of superiority or even reliability.
Authors: The referee correctly notes the absence of these experimental controls. The current manuscript omits details on data splits, baselines, hyper-parameter tuning, and statistical testing. We will expand the experimental section to report the train/test split ratios and randomization method, any baseline detectors evaluated, the hyper-parameter search strategy, and the statistical tests applied to the reported accuracies and IoU scores. revision: yes
-
Referee: [Abstract] The supervised-learning pipeline implicitly assumes that the external labels constitute an independent and sufficiently complete oracle; the manuscript provides no evidence that the label source (static analyzer, dynamic detector, or manual annotation) was independent of the model or that coverage gaps were quantified.
Authors: We acknowledge the validity of this observation. The manuscript does not identify the label source or supply evidence of independence or coverage quantification. In the revision we will explicitly state the provenance of the labels and add a limitations paragraph discussing the independence assumption and any unquantified coverage gaps. revision: yes
Circularity Check
No circularity; standard supervised learning on externally labeled data
full rationale
The paper trains a one-layer CNN on labeled OpenMP and POSIX source-code collections and reports file-level classification accuracy plus line-level IoU on the same collections. No equations, derivations, or self-citations are shown that reduce the reported metrics to quantities defined by the model itself or by prior author work. The performance numbers are ordinary empirical results of supervised training and evaluation against independent labels; the derivation chain therefore contains no load-bearing self-definition or fitted-input-called-prediction steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- CNN window sizes
- Training hyperparameters
axioms (2)
- domain assumption Labeled source-code datasets correctly mark the presence and location of data races
- domain assumption Source code can be processed as fixed-length sequences or token streams suitable for 1-D convolution
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our model uses a one-layer convolutional neural network (CNN) with different window size to find data races method. ... We tested the model with OpenMP and POSIX source code datasets ... accuracies ranging from 81% and 86%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
K. Poulsen (2004). Software bug contributed to blackout. Secure. Focus
work page 2004
-
[2]
J. Constine (2013). NASDAQ’s Glitch Cost Facebook Investors ~$500M. Available: https://techcrunch.com/2013/03/25/ip -oh-my-gosh-all-that- money-just-disappeared/. [Accessed: 27-Dec-2017]
work page 2013
-
[3]
S. Lu, S. Park, E. Seo, and Y. Zh ou (2008). Learning from mistakes: a comprehensive study on real -world concurrency bug characteristics . In ASPLOS. ACM. 329–339
work page 2008
-
[4]
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder (2007). Automatically classifying benign and harmful data races using replay analysis . in PLDI. ACM. 22–31
work page 2007
- [5]
-
[6]
D. Engler and K. Ashcraft (2003). RacerX: effective, static detection of race conditions and deadlocks. in SOSP. ACM. 237–252
work page 2003
- [7]
- [8]
-
[9]
M. Naik, A. Aiken, and J. Whaley (2006). Effective static race detection for Java . in PLDI. ACM. 308-319
work page 2006
-
[10]
Y. Yu, T. Rodeheffer, and W. Chen (2005). Racetrack: efficient detection of data race conditions via adaptive tracking. In SOSP. ACM. 221–234
work page 2005
-
[11]
M. D. Bond, K. E. Coons, and K. S. McKinley (2010). PACER: proportional detection of data races. In PLDI. ACM. 255–268
work page 2010
-
[12]
R. O’Callahan and J.D. Choi (2003). Hybrid dynamic data race detection . In PPoPP. ACM. 167–178
work page 2003
-
[13]
A. Jannesari and W. F. Tichy (2014). Library-independent data race detection . IEEE Trans. Parallel Distrib. Syst. 25(10). 2606–2616
work page 2014
-
[14]
J. Li, P. He, J. Zhu, and M. R. Lyu (2017). Software Defect Prediction via Convolutional Neural Network. In QRS. IEEE. 318–328
work page 2017
-
[15]
L. Li, H. Feng, W. Zhuang, N. Meng, and B. Ryder (2017). CCLearner: A Deep Learning-Based Clone Detection Approach. In ICSME. IEEE. 249–260
work page 2017
-
[16]
C. Liu, X. W ang, R. Shin, J. E. Gonzalez, and D. Song (2016). Neural Code Completion
work page 2016
-
[17]
“DataRaceBench.” Available: https://github.com/LLNL/dataracebench. [Accessed: 27-Dec-2017]
work page 2017
-
[18]
K. Serebryany and T. Iskhodzhanov (2009). ThreadSanitizer: data race detection in practice. In WBIA. ACM. 62–71
work page 2009
-
[19]
Fast and accurate static data -race detection for concurrent programs
Kahlon, Vineet, Yu Yang, Sriram Sankaranarayanan, and Aarti Gupta (2007). Fast and accurate static data -race detection for concurrent programs. In CAV. Berlin, Heidelberg: Springer. 226-239
work page 2007
-
[20]
S. Hochreiter and J. Schmidhuber (1997). Long short -term memory . Neural Comput. 9(8). 1735–1780
work page 1997
-
[21]
V. Raychev, M. Vechev, and E. Yahav (2014). Code completion with statistical language models. In PLDI. ACM. 419–428
work page 2014
-
[22]
V. Raychev, P. Bielik, and M. Vechev (2016). Probabilistic model for code with decision trees. In OOPSLA. ACM. 731–747
work page 2016
-
[23]
A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2015). Combining deep learning with information retrieval to localize buggy files for bug reports . In ASE. IEEE. 476–481
work page 2015
-
[24]
X. Huo, M. Li, and Z. H. Zhou (2016). Learning unified features from natural and programming languages for locating buggy source code . In IJCAI, ACM. 1606 – 1612
work page 2016
-
[25]
S. Wang, T. Liu, and L. Tan (2016). Automatically learning semantic features for defect prediction. In ICSE. IEEE. 297–308
work page 2016
-
[26]
G. E. Hinton , S. Osindero, and Y. -W. The ( 2006). A fast learning algorithm for deep belief nets. Neural Comput. 18(7). 1527–1554
work page 2006
-
[27]
“150k JavaScript Dataset.” [Online]. Available: https://www.sri.inf.ethz.ch/js150.php. [Accessed: 23 -Apr-2018]
work page 2018
-
[28]
“150k Python Dataset.” [Online]. Available: https://www.sri.inf.ethz.ch/py150. [Accessed: 23-Apr-2018]
work page 2018
-
[29]
H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin (2015). Building program vector representations for deep learning. In KSEM. Springer, Cham. 547–553
work page 2015
-
[30]
Available: ht tps://github.com/eliben/pycparser
pycparser. Available: ht tps://github.com/eliben/pycparser. [Accessed: 23 - Apr-2018]
work page 2018
-
[31]
Keras: The Python Deep Learning library. Available: https://keras.io/. [Accessed: 27-Dec-2017]
work page 2017
-
[32]
Available: https://www.tensorflow.org/
An open -source software library for Machine Intelligence. Available: https://www.tensorflow.org/. [Accessed: 27-Dec-2017]
work page 2017
- [33]
-
[34]
A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep co nvolutional neural networks . In NIPS. Curran Associates. 1097 – 1105
work page 2012
-
[35]
S. Atzeni et al. (2016). ARCHER: effectively spotting data races in large OpenMP applications. In IPDPS. IEEE. 53–62
work page 2016
-
[36]
Learning deep features for discriminative localization
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio T orralba (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Re cognition. 2921 - 2929
work page 2016
-
[37]
ConPredictor: Concurrency Defect Pre diction in Real -World Applications
Yu, Tingting, Wei Wen, Xue Han, and Jane Hayes (2018). ConPredictor: Concurrency Defect Pre diction in Real -World Applications. IEEE Transa ctions on Software Engineering
work page 2018
-
[38]
CCmutator: A mutation generator for concurrency constructs in multithreaded C/C++ applications
Kusano, Markus, and Chao Wang (2013). CCmutator: A mutation generator for concurrency constructs in multithreaded C/C++ applications. In Proceedings of the 28th IEEE/ACM Interna tional Conference on Automated Software Engineering. 722-725
work page 2013
-
[39]
LLVM and Clang: Next generation compiler technology
Lattner, Chris (2008). LLVM and Clang: Next generation compiler technology. In The BSD conference
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.