pith. machine review for the scientific record. sign in

arxiv: 2604.23361 · v1 · submitted 2026-04-25 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:54 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords local LLMsbug detectionPythonBugsInPyLLaMA 3.2Mistralzero-shot promptingsoftware engineering
0
0 comments X

The pith

Locally run LLMs detect Python bugs with 43 to 45 percent accuracy on a standard benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates whether large language models that run entirely on local hardware can locate bugs in real Python projects without sending code to external services. The authors test LLaMA 3.2 and Mistral on 349 bugs across 17 projects from the BugsInPy collection by prompting each model once per function to describe any issues it finds. The models reach overall accuracy in the low-to-mid 40s percent range and frequently produce responses that correctly flag the general area of a bug even when they do not name the precise correction. Accuracy differs sharply from one project to another, showing that characteristics of the surrounding code matter more than raw model size for this task.

Core claim

Two locally executed models achieve 43 percent and 45 percent accuracy when prompted in zero-shot fashion to identify bugs at the function level in the BugsInPy benchmark. A large share of their non-exact answers still correctly identify the problematic code region without specifying the exact fix. Detection performance varies significantly across the 17 projects, indicating that codebase-specific traits strongly influence how well local LLMs perform bug detection.

What carries the argument

Zero-shot function-level prompting of local models combined with an automated keyword-based scoring system that labels responses as correct, partially correct, or incorrect.

If this is right

  • Local models can already surface a meaningful fraction of bugs while keeping all code on the user's machine.
  • Exact bug localization stays difficult, especially for bugs that depend on broader context.
  • Project-specific traits dominate performance, so results from one codebase do not reliably predict results on another.
  • Many responses supply useful partial information even when they fall short of a complete diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams working under privacy or connectivity constraints could use these models for an initial automated screen before human review.
  • Pairing local LLMs with static analysis tools might convert many partially correct outputs into actionable fixes.
  • The observed project-to-project variation suggests that lightweight per-project adaptation or few-shot examples drawn from the same codebase could raise accuracy without moving to cloud services.

Load-bearing premise

The keyword-based automated framework correctly distinguishes fully correct, partially correct, and incorrect answers without needing human judgment or deeper semantic analysis.

What would settle it

A manual human review of the 349 model responses that reclassifies enough partially correct or incorrect answers as fully correct to push accuracy well above 45 percent would falsify the reported performance numbers.

Figures

Figures reproduced from arXiv: 2604.23361 by Jelena Ili\'c Vuli\'cevi\'c.

Figure 1
Figure 1. Figure 1: Overall bug detection results by model. G. Bug Type Classification To enable fine-grained analysis, each of the 349 bugs was manually assigned to one of nine categories based on the nature of the change between the buggy and fixed versions: Null/None Check, Return Value, Conditional Logic, Indexing, Error Handling, Loop Logic, Type Conversion, Comparison Operator, and Other/Complex. Classification was perf… view at source ↗
Figure 2
Figure 2. Figure 2: Bug detection accuracy by project. TABLE IV SCORE DISTRIBUTION BY PROJECT (%). ROWS MAY NOT SUM TO EXACTLY 100 DUE TO ROUNDING. LLaMA 3.2 Mistral Project Correct Partial Wrong Correct Partial Wrong PySnooper 100.0 0.0 0.0 100.0 0.0 0.0 black 73.7 26.3 0.0 68.4 31.6 0.0 fastapi 69.2 30.8 0.0 61.5 38.5 0.0 ansible 61.5 38.5 0.0 69.2 30.8 0.0 keras 54.8 38.7 6.5 48.4 48.4 3.2 tornado 50.0 50.0 0.0 60.0 20.0 2… view at source ↗
Figure 3
Figure 3. Figure 3: Score distribution by project for LLaMA 3.2. view at source ↗
Figure 5
Figure 5. Figure 5: Bug detection accuracy by bug type for both models. Numbers in view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents an empirical evaluation of two locally deployed LLMs (LLaMA 3.2 and Mistral) for bug detection in Python code. It applies zero-shot prompting at the function level to 349 bugs from the BugsInPy benchmark across 17 projects and classifies outputs via an automated keyword-based framework into correct, partially correct, or incorrect. Reported results include accuracies of 43-45%, a large share of partially correct responses that identify problematic regions without exact fixes, and substantial performance variation across projects, which the authors attribute to codebase characteristics.

Significance. If the results hold, the work usefully demonstrates the feasibility of privacy-preserving, on-device LLM use for a core software engineering task and underscores that local models can surface a meaningful fraction of real bugs even if precise localization remains challenging. The scale (349 bugs, 17 projects) and use of the established BugsInPy benchmark are strengths that support comparability with prior cloud-based studies. The emphasis on project-level variation also supplies a concrete direction for future work on context-aware adaptation.

major comments (1)
  1. [§4.3] §4.3 (Evaluation Framework): The headline accuracy figures (43-45%) and the claim of a 'large proportion of partially correct responses' rest entirely on an automated keyword-based classifier with no reported validation against human judgments, inter-annotator agreement, error analysis, or semantic comparison. Because keyword rules may misclassify context-dependent or partially correct answers (especially in complex functions), both the quantitative results and the interpretation of model capabilities are load-bearing on an unverified metric.
minor comments (2)
  1. [Abstract] Abstract: The statement that models produce 'a large proportion of partially correct responses' is not accompanied by any numerical breakdown or percentage, which would allow readers to assess the practical weight of this category.
  2. [§5] §5 (Results): Tables reporting per-project accuracies would benefit from confidence intervals or statistical tests to substantiate the claim of 'significant' variation across projects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the significance of evaluating locally deployed LLMs on the BugsInPy benchmark. We address the single major comment below and will revise the manuscript to strengthen the evaluation framework.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Evaluation Framework): The headline accuracy figures (43-45%) and the claim of a 'large proportion of partially correct responses' rest entirely on an automated keyword-based classifier with no reported validation against human judgments, inter-annotator agreement, error analysis, or semantic comparison. Because keyword rules may misclassify context-dependent or partially correct answers (especially in complex functions), both the quantitative results and the interpretation of model capabilities are load-bearing on an unverified metric.

    Authors: We agree that the absence of validation for the automated keyword-based classifier is a limitation that affects the strength of the reported results. The classifier was constructed from recurring output patterns observed during pilot runs (e.g., presence of terms such as 'bug', 'error', 'incorrect', 'fix', or 'line X' for localization), but we did not quantify its agreement with human judgment in the submitted manuscript. In the revised version we will add a dedicated validation subsection: two authors will independently label a stratified random sample of 100 model outputs (50 per model) using the same three-way taxonomy, report Cohen's kappa for inter-annotator agreement, and compute agreement between the human labels and the automated classifier. We will also include a short error analysis highlighting any systematic misclassifications, particularly on context-dependent or multi-line bugs. These additions will directly support the 43-45% accuracy figures and the interpretation of partially correct responses. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential reductions

full rationale

The paper conducts an empirical study by executing two LLMs on the external BugsInPy benchmark, applying zero-shot prompts, and classifying outputs via a keyword-based framework. No equations, parameter fitting, predictions derived from fits, or load-bearing self-citations appear in the provided text or abstract. Results are direct experimental observations rather than any chain that reduces by construction to its own inputs, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study that introduces no new free parameters, axioms beyond standard assumptions in SE research, or invented entities.

axioms (1)
  • domain assumption The BugsInPy benchmark is a valid representation of real-world Python bugs for evaluation purposes
    The study relies on this benchmark for all 349 bugs without additional validation mentioned in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1313 out tokens · 39941 ms · 2026-05-08T07:54:42.258666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiamet al., “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023. Available: https://arxiv.org/abs/2303.08774

  2. [2]

    Claude 3 Model Card,

    Anthropic, “Claude 3 Model Card,”Technical Re- port, 2024. Available: https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf

  3. [3]

    LLaMA 3.2 Model Card,

    Meta AI, “LLaMA 3.2 Model Card,”Technical Report, 2024. Available: https://www.llama.com/docs/model-cards-and-prompt-formats/ llama3_2/

  4. [4]

    Mistral 7B

    A. Jianget al., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023. Available: https://arxiv.org/abs/2310.06825

  5. [5]

    TIOBE Programming Community Index,

    TIOBE Index, “TIOBE Programming Community Index,” 2024. Avail- able: https://www.tiobe.com/tiobe-index/

  6. [6]

    BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies,

    R. Widyasariet al., “BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies,” in Proc. of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, pp. 1556–1560, doi: 10.1145/3368089.3417943

  7. [7]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, et al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages,

    Z. Feng, D. Guo, D. Tang, et al., “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” inProc. EMNLP, 2020, pp. 1536–1547

  9. [9]

    A Quantitative and Qualitative Evalu- ation of LLM-Based Explainable Fault Localization,

    S. Kang, G. An, and S. Yoo, “A Quantitative and Qualitative Evalu- ation of LLM-Based Explainable Fault Localization,”arXiv preprint arXiv:2308.05487, 2023

  10. [10]

    Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

    A. Mhatre, N. Nader, P. Diehl, and D. Gupta, “LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python,”arXiv preprint arXiv:2508.16419, 2025

  11. [11]

    Evaluating LLMs Ef- fectiveness in Detecting and Correcting Test Smells,

    E. G. Santana, M. Rehman, and F. Perez, “Evaluating LLMs Ef- fectiveness in Detecting and Correcting Test Smells,”arXiv preprint arXiv:2506.07594, 2025

  12. [12]

    Enhancing Bug Report Quality Using LLMs,

    J. Acharya and G. Ginde, “Enhancing Bug Report Quality Using LLMs,” arXiv preprint arXiv:2504.18804, 2025

  13. [13]

    Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks,

    H. Lee, S. Sharma, and B. Hu, “Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks,”arXiv preprint arXiv:2406.15325, 2024

  14. [14]

    Bugs in large language models generated code: An empirical study,

    F. Tambon, L. Smith, and J. Kim, “Bugs in Large Language Models Generated Code: An Empirical Study,”arXiv preprint arXiv:2403.08937, 2024

  15. [15]

    Reproducing and Improving the BugsInPy Dataset,

    F. Aguilar, S. Grayson, and D. Marinov, “Reproducing and Improving the BugsInPy Dataset,” inProc. IEEE SCAM, Bogotá, Colombia, 2023, pp. 260–264, doi:10.1109/SCAM59687.2023.00036

  16. [16]

    Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection,

    C. Pushkar, S. Kabra, D. Kumar, and J. Challa, “Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection,” arXiv preprint arXiv:2512.22306, 2025

  17. [17]

    A Systematic Literature Review on Large Language Models for Automated Program Repair,

    Q. Zhang, C. Fang, Y . Xie, et al., “A Systematic Literature Review on Large Language Models for Automated Program Repair,”arXiv preprint arXiv:2405.01466, 2025

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, et al., “LLaMA: Open and Efficient Foundation Language Models,”arXiv preprint arXiv:2302.13971, 2023

  19. [19]

    Available: https://ollama.com

    Ollama, “Ollama,” 2023. Available: https://ollama.com

  20. [20]

    Psychometrika12(2), 153–157 (1947)https://doi

    Q. McNemar, “Note on the sampling error of the difference between correlated proportions,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947, doi:10.1007/BF02295996. 8