pith. sign in

arxiv: 2508.16419 · v2 · submitted 2025-08-22 · 💻 cs.SE · cs.LG

Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

Pith reviewed 2026-05-18 21:37 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords LLM evaluationbug detectionsecurity vulnerabilitiesPythonC++code auditingempirical studysoftware bugs
0
0 comments X

The pith

LLMs detect basic syntactic and semantic bugs in small code well but lose effectiveness on complex security vulnerabilities in large production programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three leading LLMs on their ability to find bugs in Python and C++ ranging from simple beginner mistakes to advanced security issues. It draws on real code samples from educational labs, OpenSSL, and bug repositories, then applies a multi-stage prompting method that mimics realistic debugging. All models perform strongly when the code is limited in scope and the errors are syntactic or semantic, which supports their use in teaching or as quick first checks. Detection quality and reasoning depth fall when the task involves intricate security flaws or bigger codebases, with two of the models delivering more detailed context than the third.

Core claim

The authors establish that LLMs can effectively identify syntactic and semantic bugs in well-scoped code from Python and C++, making them useful for education and as initial reviewers in code auditing processes. Their ability to handle complex security vulnerabilities decreases when dealing with large-scale production code, although ChatGPT-4 and Claude 3 tend to offer more detailed contextual insights than LLaMA 4.

What carries the argument

Multi-stage context-aware prompting protocol applied to a benchmark of bugs from SEED Labs, OpenSSL via Suresoft GLaDOS, and PyBugHive, evaluated with a graded rubric for detection accuracy, reasoning depth, and remediation suggestions.

If this is right

  • LLMs can serve as practical aids in educational settings for helping students spot and understand basic coding errors.
  • They function effectively as first-pass reviewers inside automated code auditing pipelines for well-scoped programs.
  • Human oversight or complementary tools remain necessary when auditing code that contains complex security vulnerabilities.
  • Differences among models indicate that selection of a particular LLM can improve the quality of contextual bug analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding these LLMs into development environments could accelerate routine debugging for small to medium projects.
  • The observed performance gap on large codebases points toward the value of pairing LLMs with static analysis engines.
  • Fine-tuning future models on broader collections of real security vulnerabilities might narrow the current limitations.
  • Teams working on production systems could adopt a tiered approach that routes simple checks to LLMs and reserves experts for high-risk sections.

Load-bearing premise

The chosen datasets from SEED Labs, OpenSSL, and PyBugHive together with the multi-stage prompting protocol give a representative and unbiased test of LLM performance across bug types and code scales.

What would settle it

A replication that finds comparable detection rates for advanced security vulnerabilities as for basic syntactic errors across the same benchmark datasets would contradict the reported drop in performance.

Figures

Figures reproduced from arXiv: 2508.16419 by Akshay Mhatre, Deepti Gupta, Noujoud Nader, Patrick Diehl.

Figure 1
Figure 1. Figure 1: Effort and Quality of the bug detection for (a) in Section IV-A, (b) in Section V, and (c) in Section V. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Advanced Python bugs In conclusion, LLMs are effective in identifying a broad class of software bugs, particularly those that are syntactic, shallow, or pedagogically motivated. They offer substantial promise in educational tools, static analysis assistance, and as first-pass reviewers in code auditing workflows. However, for more complex tasks, their current capabilities remain limited. Future Work. We wi… view at source ↗
read the original abstract

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide a systematic empirical evaluation of LLMs (ChatGPT-4, Claude 3, LLaMA 4) for bug detection in Python and C++ code, spanning beginner errors to security vulnerabilities. Using integrated datasets from SEED Labs, OpenSSL (Suresoft GLaDOS), and PyBugHive with local validation pipelines, and a novel multi-stage context-aware prompting protocol with a graded rubric, it finds that models excel at syntactic and semantic issues in well-scoped code but performance drops for complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 offering more nuanced analyses than LLaMA 4.

Significance. If the central results hold after addressing dataset characterization issues, the work would be significant for the software engineering community by providing evidence-based guidance on LLM deployment in code review and education. It highlights both capabilities and limitations, encouraging further research into improving LLM performance on complex, real-world code. The reproducible validation pipelines and rubric are positive aspects for future benchmarking studies.

major comments (2)
  1. [§3 (Dataset and Methodology)] The headline claim of diminished performance on large-scale production code (abstract and §5) is load-bearing for the paper's conclusions about LLM constraints. However, the selected samples from OpenSSL and PyBugHive are not characterized quantitatively (e.g., no LOC distributions, cyclomatic complexity metrics, or dependency graphs provided), nor benchmarked against typical production codebases. This raises the possibility that the performance gap reflects dataset properties rather than general LLM limitations.
  2. [§5 (Results)] The results section reports performance metrics across bug categories and models, but lacks statistical details such as significance tests, confidence intervals, or full tabulated breakdowns of the graded rubric scores. This weakens the ability to substantiate claims of 'excel' versus 'diminishes' and the relative superiority of ChatGPT-4 and Claude 3.
minor comments (2)
  1. [Abstract] Clarify the exact model versions used, particularly 'LLaMA 4', to avoid ambiguity with publicly available releases.
  2. [§4 (Prompting Protocol)] The description of the multi-stage prompting protocol would benefit from an explicit example prompt template in an appendix to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which will help strengthen the empirical rigor of our evaluation. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3 (Dataset and Methodology)] The headline claim of diminished performance on large-scale production code (abstract and §5) is load-bearing for the paper's conclusions about LLM constraints. However, the selected samples from OpenSSL and PyBugHive are not characterized quantitatively (e.g., no LOC distributions, cyclomatic complexity metrics, or dependency graphs provided), nor benchmarked against typical production codebases. This raises the possibility that the performance gap reflects dataset properties rather than general LLM limitations.

    Authors: We agree that quantitative characterization of the code samples is necessary to support claims about performance on large-scale production code. In the revised manuscript we will add LOC distributions, cyclomatic complexity metrics, and basic dependency summaries for the OpenSSL and PyBugHive subsets. We will also include a brief comparison against publicly reported statistics for typical production codebases to help readers assess whether the observed gaps are dataset-specific or reflect broader LLM limitations. revision: yes

  2. Referee: [§5 (Results)] The results section reports performance metrics across bug categories and models, but lacks statistical details such as significance tests, confidence intervals, or full tabulated breakdowns of the graded rubric scores. This weakens the ability to substantiate claims of 'excel' versus 'diminishes' and the relative superiority of ChatGPT-4 and Claude 3.

    Authors: We acknowledge that the current presentation would benefit from additional statistical support. In the revision we will add bootstrap confidence intervals for the main accuracy metrics, report the results of appropriate paired statistical tests (e.g., McNemar or Wilcoxon signed-rank) for model and category comparisons, and include a complete tabulated breakdown of the graded rubric scores either in the main text or as an appendix. These additions will provide a more rigorous basis for the distinctions drawn between model capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking against external ground-truth datasets

full rationale

This is a straightforward empirical evaluation study that measures LLM performance on bug detection using pre-existing external datasets (SEED Labs, Suresoft GLaDOS OpenSSL, PyBugHive) with ground-truth labels obtained via compilation and testing. No mathematical derivations, equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the abstract or described methods. The results consist of measured accuracies and qualitative assessments against independent benchmarks, with no reduction of claims to self-defined quantities or ansatzes. The study is therefore self-contained against external data and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard empirical evaluation methods and existing public benchmarks without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5780 in / 1269 out tokens · 73241 ms · 2026-05-18T21:37:34.317796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can LLMs Make (Personalized) Access Control Decisions?

    cs.CR 2025-11 unverdicted novelty 5.0

    LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results w...

  2. An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

    cs.SE 2026-04 unverdicted novelty 4.0

    Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    A survey on evaluation of large language models,

    Y . Chang et al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology , vol. 15, no. 3, pp. 1–45, 2024

  2. [2]

    GPT-4 Technical Report

    J. Achiam et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Claude — claude.ai,

    Claude, “Claude — claude.ai,” https://claude.ai/new, [Accessed 11-08- 2025]

  4. [4]

    Github copilot ai pair programmer: Asset or liability?

    A. M. Dakhel et al. , “Github copilot ai pair programmer: Asset or liability?” Journal of Systems and Software , vol. 203, p. 111734, 2023

  5. [5]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

  6. [6]

    BERT: a review of applications in natural language processing and understanding,

    M. V . Koroteev, “BERT: a review of applications in natural language processing and understanding,” arXiv preprint arXiv:2103.11943, 2021

  7. [7]

    Evaluating AI-generated code for C ++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust,

    P. Diehl et al. , “Evaluating AI-generated code for C ++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust,” arXiv preprint arXiv:2405.13101, 2024

  8. [8]

    LLM & HPC: Benchmarking DeepSeek’s Performance in High–Performance Computing Tasks,

    N. Nader et al. , “LLM & HPC: Benchmarking DeepSeek’s Performance in High–Performance Computing Tasks,” arXiv preprint arXiv:2504.03665 , 2025. [Online]. Available: https: //arxiv.org/abs/2504.03665

  9. [9]

    LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages,

    P. Diehl et al. , “LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages,” Journal of Machine Learning for Modeling and Computing, vol. 6, no. 3, 2025

  10. [10]

    An empirical study on low code programming us- ing traditional vs large language model support,

    Y . Liu et al. , “An empirical study on low code programming us- ing traditional vs large language model support,” arXiv preprint arXiv:2402.01156, 2024

  11. [11]

    PyBugHive: A Comprehensive Database of Manually Validated, Reproducible Python Bugs,

    G. Antal et al. , “PyBugHive: A Comprehensive Database of Manually Validated, Reproducible Python Bugs,” IEEE Access, 2024

  12. [12]

    Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection,

    T. Zhang et al. , “Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection,” arXiv preprint arXiv:2503.01449, 2025

  13. [13]

    Lprotector: An llm-driven vulnerability detection system,

    Z. Sheng et al. , “Lprotector: An llm-driven vulnerability detection system,” arXiv preprint arXiv:2411.06493 , 2024

  14. [14]

    FuncVul: An Effective Function Level Vulnerabil- ity Detection Model using LLM and Code Chunk,

    S. Halder et al. , “FuncVul: An Effective Function Level Vulnerabil- ity Detection Model using LLM and Code Chunk,” arXiv preprint arXiv:2506.19453, 2025

  15. [15]

    Do LLMs consider security? an empirical study on responses to programming questions,

    A. Sajadi et al. , “Do LLMs consider security? an empirical study on responses to programming questions,” Empirical Software Engineering, vol. 30, no. 3, p. 101, 2025

  16. [16]

    RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

    I. Bouzenia et al. , “Repairagent: An autonomous, llm-based agent for program repair,” arXiv preprint arXiv:2403.17134 , 2024

  17. [17]

    How far can we go with practical function-level program repair?

    J. Xiang et al., “How far can we go with practical function-level program repair?” arXiv preprint arXiv:2404.12833 , 2024

  18. [18]

    From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024

    E. Basic and A. Giaretta, “Large Language Models and Code Security: A Systematic Literature Review,” arXiv preprint arXiv:2412.15004, 2024

  19. [19]

    Zhang, C

    Q. Zhang et al., “A systematic literature review on large language models for automated program repair,” arXiv preprint arXiv:2405.01466, 2024

  20. [20]

    Software engineering economics,

    B. W. Boehm, “Software engineering economics,” in Software pioneers: Contributions to software engineering . Springer, 2011, pp. 641–686

  21. [21]

    B. W. Boehm et al. , Software cost estimation with COCOMO II . Prentice Hall Press, 2009