pith. sign in

arxiv: 1907.07110 · v1 · pith:3OPDBPTSnew · submitted 2019-07-15 · 💻 cs.DC

DeepRace: Finding Data Race Bugs via Deep Learning

Pith reviewed 2026-05-24 21:35 UTC · model grok-4.3

classification 💻 cs.DC
keywords data race detectiondeep learningconvolutional neural networkconcurrency bugsOpenMPPOSIXbug localizationsource code classification
0
0 comments X

The pith

A one-layer CNN classifies parallel source code for data races and localizes buggy lines via activation maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepRace, a supervised CNN model trained to classify source files from OpenMP and POSIX collections as containing data race bugs. It reaches 81-86% file-level accuracy and then applies class activation mapping with global average pooling to highlight the specific buggy lines, scoring 66% IoU. The goal is to replace hand-crafted detectors with a learned classifier that operates directly on raw source code. If the approach holds, developers could obtain both file flags and line-level pointers without writing domain-specific analysis rules. The work focuses on demonstrating that the same network architecture can handle both classification and localization tasks on the given datasets.

Core claim

DeepRace trains a one-layer convolutional neural network on labeled OpenMP and POSIX source files to classify them for data race presence and then uses class activation maps to back-propagate from the final convolutional layer, identifying the exact lines responsible for the race; on the test collections this yields 81-86% file accuracy, 66% line IoU, and between one and ten false positives or negatives per file.

What carries the argument

One-layer CNN with multiple window sizes plus class activation mapping and global average pooling that back-propagates to mark buggy lines in the input source code.

If this is right

  • The model classifies buggy source files at 81-86% accuracy on the OpenMP and POSIX collections.
  • It localizes buggy lines inside those files at 66% IoU while producing only 1-10 false positives or negatives.
  • It can flag multiple buggy lines at different positions within the same file.
  • The same network architecture handles both file-level classification and line-level localization without separate detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retraining the same CNN architecture on labels for other concurrency issues such as deadlocks could extend the method beyond races.
  • Embedding the model in an IDE could surface candidate race lines while a developer is editing parallel code.
  • Measuring performance on parallel code written in languages other than C/C++ would test how far the learned patterns transfer.
  • Raising line-level IoU above 66% would be required before the localization output becomes reliable enough for automated repair suggestions.

Load-bearing premise

The training labels correctly mark which files and lines contain data races and the labeled collections represent the code the model will see in practice.

What would settle it

Evaluating the trained model on a fresh collection of parallel programs whose data-race locations have been independently confirmed by an existing static or dynamic detector and checking whether file accuracy stays above 80% and line IoU stays above 60%.

Figures

Figures reproduced from arXiv: 1907.07110 by Ali Jannesari, Ali Tehrani, Mohammed Khaleel, Reza Akbari.

Figure 1
Figure 1. Figure 1: An example of a data race in OpenMP program (left), resolving the data race via synchronization primitive (right) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

With the proliferation of multi-core hardware, parallel programs have become ubiquitous. These programs have their own type of bugs known as concurrency bugs and among them, data race bugs have been mostly in the focus of researchers over the past decades. In fact, detecting data races is a very challenging and important task. There have been several research paths in this area with many sophisticated tools designed and utilized that focus on detecting data race at the file level. In this paper, we propose DeepRace, a novel approach toward detecting data races in the source code. We build a deep neural network model to find data races instead of creating a data race detector manually. Our model uses a one-layer convolutional neural network (CNN) with different window size to find data races method. Then we adopt the class activation map function with global average pooling to extract the weights of the last convolutional layer and backpropagate it with the input source code to extract the line of codes with a data race. Thus, the DeepRace model can detect the data race bugs on a file and line of code level. In addition, we noticed that DeepRace successfully detects several buggy lines of code at different locations of the file. We tested the model with OpenMP and POSIX source code datasets which consist of more than 5000 and 8000 source code files respectively. We were able to successfully classify buggy source code files and achieve accuracies ranging from 81% and 86%. We also measured the performance of detecting and visualizing the data race at the line of code levels and our model achieved promising results. We only had a small number of false positives and false, ranging from 1 to 10. Furthermore, we used the intersection of union to measure the accuracy of the buggy lines of code, our model achieved promising results of 66 percent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DeepRace, a one-layer CNN model augmented with class activation mapping and global average pooling, to detect data-race bugs in source code at both file and line granularity. It reports file-level classification accuracies of 81–86% and a 66% IoU for localizing buggy lines on labeled OpenMP (>5000 files) and POSIX (>8000 files) collections, claiming the model can identify multiple buggy lines at different locations with few false positives.

Significance. If the empirical claims can be substantiated with transparent labeling protocols, reproducible splits, and appropriate baselines, the work would constitute a concrete demonstration that a lightweight CNN can perform both classification and localization of concurrency bugs on realistic code corpora. Such a result would be of interest to the concurrency-analysis community as an existence proof for learned detectors, though its practical impact would still depend on generalization beyond the training labelers.

major comments (3)
  1. [Abstract] Abstract: the central numerical claims (81–86% file accuracy, 66% line IoU) are presented without any description of the labeling protocol used to mark buggy files or specific lines in the OpenMP and POSIX collections; because data-race ground truth is itself approximate, the absence of this information makes the reported metrics unverifiable and prevents assessment of whether the model has learned genuine races or merely the labeler’s biases.
  2. [Abstract] Abstract: no information is supplied on train/test split methodology, baseline detectors, hyper-parameter search procedure, or statistical tests; without these standard controls the accuracy and IoU figures cannot be interpreted as evidence of superiority or even reliability.
  3. [Abstract] The supervised-learning pipeline implicitly assumes that the external labels constitute an independent and sufficiently complete oracle; the manuscript provides no evidence that the label source (static analyzer, dynamic detector, or manual annotation) was independent of the model or that coverage gaps were quantified.
minor comments (2)
  1. [Abstract] Abstract contains a typographical error: “false, ranging from 1 to 10” should read “false negatives”.
  2. [Abstract] The claim that the model “successfully detects several buggy lines of code at different locations” is stated without quantitative support beyond the aggregate IoU figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central numerical claims (81–86% file accuracy, 66% line IoU) are presented without any description of the labeling protocol used to mark buggy files or specific lines in the OpenMP and POSIX collections; because data-race ground truth is itself approximate, the absence of this information makes the reported metrics unverifiable and prevents assessment of whether the model has learned genuine races or merely the labeler’s biases.

    Authors: We agree that the labeling protocol is necessary to interpret the results. The manuscript does not describe how the ground-truth labels were generated for the OpenMP and POSIX collections. In the revised version we will add a dedicated subsection that specifies the labeling procedure, the tools or annotators involved, and any known limitations of the ground truth. revision: yes

  2. Referee: [Abstract] Abstract: no information is supplied on train/test split methodology, baseline detectors, hyper-parameter search procedure, or statistical tests; without these standard controls the accuracy and IoU figures cannot be interpreted as evidence of superiority or even reliability.

    Authors: The referee correctly notes the absence of these experimental controls. The current manuscript omits details on data splits, baselines, hyper-parameter tuning, and statistical testing. We will expand the experimental section to report the train/test split ratios and randomization method, any baseline detectors evaluated, the hyper-parameter search strategy, and the statistical tests applied to the reported accuracies and IoU scores. revision: yes

  3. Referee: [Abstract] The supervised-learning pipeline implicitly assumes that the external labels constitute an independent and sufficiently complete oracle; the manuscript provides no evidence that the label source (static analyzer, dynamic detector, or manual annotation) was independent of the model or that coverage gaps were quantified.

    Authors: We acknowledge the validity of this observation. The manuscript does not identify the label source or supply evidence of independence or coverage quantification. In the revision we will explicitly state the provenance of the labels and add a limitations paragraph discussing the independence assumption and any unquantified coverage gaps. revision: yes

Circularity Check

0 steps flagged

No circularity; standard supervised learning on externally labeled data

full rationale

The paper trains a one-layer CNN on labeled OpenMP and POSIX source-code collections and reports file-level classification accuracy plus line-level IoU on the same collections. No equations, derivations, or self-citations are shown that reduce the reported metrics to quantities defined by the model itself or by prior author work. The performance numbers are ordinary empirical results of supervised training and evaluation against independent labels; the derivation chain therefore contains no load-bearing self-definition or fitted-input-called-prediction steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the provided labeled datasets are both accurate and representative, plus standard machine-learning assumptions that a CNN can extract predictive features from tokenized source code. No new physical entities or ad-hoc constants are introduced.

free parameters (2)
  • CNN window sizes
    Multiple window sizes are used in the single convolutional layer; their specific values are chosen to capture code patterns but not reported.
  • Training hyperparameters
    Learning rate, batch size, and optimization details are required for the reported accuracies but are not stated.
axioms (2)
  • domain assumption Labeled source-code datasets correctly mark the presence and location of data races
    The supervised training and line-level evaluation both presuppose that the ground-truth labels are reliable.
  • domain assumption Source code can be processed as fixed-length sequences or token streams suitable for 1-D convolution
    The CNN architecture implicitly treats code this way.

pith-pipeline@v0.9.0 · 5869 in / 1562 out tokens · 50911 ms · 2026-05-24T21:35:02.274079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Poulsen (2004)

    K. Poulsen (2004). Software bug contributed to blackout. Secure. Focus

  2. [2]

    Constine (2013)

    J. Constine (2013). NASDAQ’s Glitch Cost Facebook Investors ~$500M. Available: https://techcrunch.com/2013/03/25/ip -oh-my-gosh-all-that- money-just-disappeared/. [Accessed: 27-Dec-2017]

  3. [3]

    S. Lu, S. Park, E. Seo, and Y. Zh ou (2008). Learning from mistakes: a comprehensive study on real -world concurrency bug characteristics . In ASPLOS. ACM. 329–339

  4. [4]

    Narayanasamy, Z

    S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder (2007). Automatically classifying benign and harmful data races using replay analysis . in PLDI. ACM. 22–31

  5. [5]

    Savage, M

    S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson (1997). Eraser: A dynamic data race detector for multithreaded programs, ACM Trans. Comput. Syst.. 15(4). 391–411

  6. [6]

    Engler and K

    D. Engler and K. Ashcraft (2003). RacerX: effective, static detection of race conditions and deadlocks. in SOSP. ACM. 237–252

  7. [7]

    Abadi, C

    M. Abadi, C. Flanagan, and S. N. Freund (2006). Types for safe locking: Static race detection for Java. ACM Trans. Program. Lang. Syst. 28(2). 207–255

  8. [8]

    Kahlon, N

    V. Kahlon, N. Sinha, E. Kruus, and Y. Zhang (2009). Static data race detection for concurrent programs with asynchronous calls . In ESEC/FSE. ACM. 13–22

  9. [9]

    M. Naik, A. Aiken, and J. Whaley (2006). Effective static race detection for Java . in PLDI. ACM. 308-319

  10. [10]

    Y. Yu, T. Rodeheffer, and W. Chen (2005). Racetrack: efficient detection of data race conditions via adaptive tracking. In SOSP. ACM. 221–234

  11. [11]

    M. D. Bond, K. E. Coons, and K. S. McKinley (2010). PACER: proportional detection of data races. In PLDI. ACM. 255–268

  12. [12]

    O’Callahan and J.D

    R. O’Callahan and J.D. Choi (2003). Hybrid dynamic data race detection . In PPoPP. ACM. 167–178

  13. [13]

    Jannesari and W

    A. Jannesari and W. F. Tichy (2014). Library-independent data race detection . IEEE Trans. Parallel Distrib. Syst. 25(10). 2606–2616

  14. [14]

    J. Li, P. He, J. Zhu, and M. R. Lyu (2017). Software Defect Prediction via Convolutional Neural Network. In QRS. IEEE. 318–328

  15. [15]

    L. Li, H. Feng, W. Zhuang, N. Meng, and B. Ryder (2017). CCLearner: A Deep Learning-Based Clone Detection Approach. In ICSME. IEEE. 249–260

  16. [16]

    C. Liu, X. W ang, R. Shin, J. E. Gonzalez, and D. Song (2016). Neural Code Completion

  17. [17]

    DataRaceBench

    “DataRaceBench.” Available: https://github.com/LLNL/dataracebench. [Accessed: 27-Dec-2017]

  18. [18]

    Serebryany and T

    K. Serebryany and T. Iskhodzhanov (2009). ThreadSanitizer: data race detection in practice. In WBIA. ACM. 62–71

  19. [19]

    Fast and accurate static data -race detection for concurrent programs

    Kahlon, Vineet, Yu Yang, Sriram Sankaranarayanan, and Aarti Gupta (2007). Fast and accurate static data -race detection for concurrent programs. In CAV. Berlin, Heidelberg: Springer. 226-239

  20. [20]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber (1997). Long short -term memory . Neural Comput. 9(8). 1735–1780

  21. [21]

    Raychev, M

    V. Raychev, M. Vechev, and E. Yahav (2014). Code completion with statistical language models. In PLDI. ACM. 419–428

  22. [22]

    Raychev, P

    V. Raychev, P. Bielik, and M. Vechev (2016). Probabilistic model for code with decision trees. In OOPSLA. ACM. 731–747

  23. [23]

    A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2015). Combining deep learning with information retrieval to localize buggy files for bug reports . In ASE. IEEE. 476–481

  24. [24]

    X. Huo, M. Li, and Z. H. Zhou (2016). Learning unified features from natural and programming languages for locating buggy source code . In IJCAI, ACM. 1606 – 1612

  25. [25]

    S. Wang, T. Liu, and L. Tan (2016). Automatically learning semantic features for defect prediction. In ICSE. IEEE. 297–308

  26. [26]

    G. E. Hinton , S. Osindero, and Y. -W. The ( 2006). A fast learning algorithm for deep belief nets. Neural Comput. 18(7). 1527–1554

  27. [27]

    150k JavaScript Dataset

    “150k JavaScript Dataset.” [Online]. Available: https://www.sri.inf.ethz.ch/js150.php. [Accessed: 23 -Apr-2018]

  28. [28]

    150k Python Dataset

    “150k Python Dataset.” [Online]. Available: https://www.sri.inf.ethz.ch/py150. [Accessed: 23-Apr-2018]

  29. [29]

    H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin (2015). Building program vector representations for deep learning. In KSEM. Springer, Cham. 547–553

  30. [30]

    Available: ht tps://github.com/eliben/pycparser

    pycparser. Available: ht tps://github.com/eliben/pycparser. [Accessed: 23 - Apr-2018]

  31. [31]

    Available: https://keras.io/

    Keras: The Python Deep Learning library. Available: https://keras.io/. [Accessed: 27-Dec-2017]

  32. [32]

    Available: https://www.tensorflow.org/

    An open -source software library for Machine Intelligence. Available: https://www.tensorflow.org/. [Accessed: 27-Dec-2017]

  33. [33]

    LeCun, L

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998). Gradient-based learning applied to document recognition. Proc. IEEE. 86(11). 2278–2324

  34. [34]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep co nvolutional neural networks . In NIPS. Curran Associates. 1097 – 1105

  35. [35]

    Atzeni et al

    S. Atzeni et al. (2016). ARCHER: effectively spotting data races in large OpenMP applications. In IPDPS. IEEE. 53–62

  36. [36]

    Learning deep features for discriminative localization

    Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio T orralba (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Re cognition. 2921 - 2929

  37. [37]

    ConPredictor: Concurrency Defect Pre diction in Real -World Applications

    Yu, Tingting, Wei Wen, Xue Han, and Jane Hayes (2018). ConPredictor: Concurrency Defect Pre diction in Real -World Applications. IEEE Transa ctions on Software Engineering

  38. [38]

    CCmutator: A mutation generator for concurrency constructs in multithreaded C/C++ applications

    Kusano, Markus, and Chao Wang (2013). CCmutator: A mutation generator for concurrency constructs in multithreaded C/C++ applications. In Proceedings of the 28th IEEE/ACM Interna tional Conference on Automated Software Engineering. 722-725

  39. [39]

    LLVM and Clang: Next generation compiler technology

    Lattner, Chris (2008). LLVM and Clang: Next generation compiler technology. In The BSD conference