Recognition: unknown
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
Pith reviewed 2026-05-08 07:54 UTC · model grok-4.3
The pith
Locally run LLMs detect Python bugs with 43 to 45 percent accuracy on a standard benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Two locally executed models achieve 43 percent and 45 percent accuracy when prompted in zero-shot fashion to identify bugs at the function level in the BugsInPy benchmark. A large share of their non-exact answers still correctly identify the problematic code region without specifying the exact fix. Detection performance varies significantly across the 17 projects, indicating that codebase-specific traits strongly influence how well local LLMs perform bug detection.
What carries the argument
Zero-shot function-level prompting of local models combined with an automated keyword-based scoring system that labels responses as correct, partially correct, or incorrect.
If this is right
- Local models can already surface a meaningful fraction of bugs while keeping all code on the user's machine.
- Exact bug localization stays difficult, especially for bugs that depend on broader context.
- Project-specific traits dominate performance, so results from one codebase do not reliably predict results on another.
- Many responses supply useful partial information even when they fall short of a complete diagnosis.
Where Pith is reading between the lines
- Teams working under privacy or connectivity constraints could use these models for an initial automated screen before human review.
- Pairing local LLMs with static analysis tools might convert many partially correct outputs into actionable fixes.
- The observed project-to-project variation suggests that lightweight per-project adaptation or few-shot examples drawn from the same codebase could raise accuracy without moving to cloud services.
Load-bearing premise
The keyword-based automated framework correctly distinguishes fully correct, partially correct, and incorrect answers without needing human judgment or deeper semantic analysis.
What would settle it
A manual human review of the 349 model responses that reclassifies enough partially correct or incorrect answers as fully correct to push accuracy well above 45 percent would falsify the reported performance numbers.
Figures
read the original abstract
Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical evaluation of two locally deployed LLMs (LLaMA 3.2 and Mistral) for bug detection in Python code. It applies zero-shot prompting at the function level to 349 bugs from the BugsInPy benchmark across 17 projects and classifies outputs via an automated keyword-based framework into correct, partially correct, or incorrect. Reported results include accuracies of 43-45%, a large share of partially correct responses that identify problematic regions without exact fixes, and substantial performance variation across projects, which the authors attribute to codebase characteristics.
Significance. If the results hold, the work usefully demonstrates the feasibility of privacy-preserving, on-device LLM use for a core software engineering task and underscores that local models can surface a meaningful fraction of real bugs even if precise localization remains challenging. The scale (349 bugs, 17 projects) and use of the established BugsInPy benchmark are strengths that support comparability with prior cloud-based studies. The emphasis on project-level variation also supplies a concrete direction for future work on context-aware adaptation.
major comments (1)
- [§4.3] §4.3 (Evaluation Framework): The headline accuracy figures (43-45%) and the claim of a 'large proportion of partially correct responses' rest entirely on an automated keyword-based classifier with no reported validation against human judgments, inter-annotator agreement, error analysis, or semantic comparison. Because keyword rules may misclassify context-dependent or partially correct answers (especially in complex functions), both the quantitative results and the interpretation of model capabilities are load-bearing on an unverified metric.
minor comments (2)
- [Abstract] Abstract: The statement that models produce 'a large proportion of partially correct responses' is not accompanied by any numerical breakdown or percentage, which would allow readers to assess the practical weight of this category.
- [§5] §5 (Results): Tables reporting per-project accuracies would benefit from confidence intervals or statistical tests to substantiate the claim of 'significant' variation across projects.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the significance of evaluating locally deployed LLMs on the BugsInPy benchmark. We address the single major comment below and will revise the manuscript to strengthen the evaluation framework.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Evaluation Framework): The headline accuracy figures (43-45%) and the claim of a 'large proportion of partially correct responses' rest entirely on an automated keyword-based classifier with no reported validation against human judgments, inter-annotator agreement, error analysis, or semantic comparison. Because keyword rules may misclassify context-dependent or partially correct answers (especially in complex functions), both the quantitative results and the interpretation of model capabilities are load-bearing on an unverified metric.
Authors: We agree that the absence of validation for the automated keyword-based classifier is a limitation that affects the strength of the reported results. The classifier was constructed from recurring output patterns observed during pilot runs (e.g., presence of terms such as 'bug', 'error', 'incorrect', 'fix', or 'line X' for localization), but we did not quantify its agreement with human judgment in the submitted manuscript. In the revised version we will add a dedicated validation subsection: two authors will independently label a stratified random sample of 100 model outputs (50 per model) using the same three-way taxonomy, report Cohen's kappa for inter-annotator agreement, and compute agreement between the human labels and the automated classifier. We will also include a short error analysis highlighting any systematic misclassifications, particularly on context-dependent or multi-line bugs. These additions will directly support the 43-45% accuracy figures and the interpretation of partially correct responses. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations or self-referential reductions
full rationale
The paper conducts an empirical study by executing two LLMs on the external BugsInPy benchmark, applying zero-shot prompts, and classifying outputs via a keyword-based framework. No equations, parameter fitting, predictions derived from fits, or load-bearing self-citations appear in the provided text or abstract. Results are direct experimental observations rather than any chain that reduces by construction to its own inputs, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The BugsInPy benchmark is a valid representation of real-world Python bugs for evaluation purposes
Reference graph
Works this paper leans on
-
[1]
J. Achiamet al., “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023. Available: https://arxiv.org/abs/2303.08774
work page internal anchor Pith review arXiv 2023
-
[2]
Claude 3 Model Card,
Anthropic, “Claude 3 Model Card,”Technical Re- port, 2024. Available: https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf
2024
-
[3]
LLaMA 3.2 Model Card,
Meta AI, “LLaMA 3.2 Model Card,”Technical Report, 2024. Available: https://www.llama.com/docs/model-cards-and-prompt-formats/ llama3_2/
2024
-
[4]
A. Jianget al., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023. Available: https://arxiv.org/abs/2310.06825
work page internal anchor Pith review arXiv 2023
-
[5]
TIOBE Programming Community Index,
TIOBE Index, “TIOBE Programming Community Index,” 2024. Avail- able: https://www.tiobe.com/tiobe-index/
2024
-
[6]
R. Widyasariet al., “BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies,” in Proc. of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, pp. 1556–1560, doi: 10.1145/3368089.3417943
-
[7]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, et al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
-
[8]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages,
Z. Feng, D. Guo, D. Tang, et al., “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” inProc. EMNLP, 2020, pp. 1536–1547
2020
-
[9]
A Quantitative and Qualitative Evalu- ation of LLM-Based Explainable Fault Localization,
S. Kang, G. An, and S. Yoo, “A Quantitative and Qualitative Evalu- ation of LLM-Based Explainable Fault Localization,”arXiv preprint arXiv:2308.05487, 2023
-
[10]
A. Mhatre, N. Nader, P. Diehl, and D. Gupta, “LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python,”arXiv preprint arXiv:2508.16419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Evaluating LLMs Ef- fectiveness in Detecting and Correcting Test Smells,
E. G. Santana, M. Rehman, and F. Perez, “Evaluating LLMs Ef- fectiveness in Detecting and Correcting Test Smells,”arXiv preprint arXiv:2506.07594, 2025
-
[12]
Enhancing Bug Report Quality Using LLMs,
J. Acharya and G. Ginde, “Enhancing Bug Report Quality Using LLMs,” arXiv preprint arXiv:2504.18804, 2025
-
[13]
Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks,
H. Lee, S. Sharma, and B. Hu, “Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks,”arXiv preprint arXiv:2406.15325, 2024
-
[14]
Bugs in large language models generated code: An empirical study,
F. Tambon, L. Smith, and J. Kim, “Bugs in Large Language Models Generated Code: An Empirical Study,”arXiv preprint arXiv:2403.08937, 2024
-
[15]
Reproducing and Improving the BugsInPy Dataset,
F. Aguilar, S. Grayson, and D. Marinov, “Reproducing and Improving the BugsInPy Dataset,” inProc. IEEE SCAM, Bogotá, Colombia, 2023, pp. 260–264, doi:10.1109/SCAM59687.2023.00036
-
[16]
Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection,
C. Pushkar, S. Kabra, D. Kumar, and J. Challa, “Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection,” arXiv preprint arXiv:2512.22306, 2025
-
[17]
A Systematic Literature Review on Large Language Models for Automated Program Repair,
Q. Zhang, C. Fang, Y . Xie, et al., “A Systematic Literature Review on Large Language Models for Automated Program Repair,”arXiv preprint arXiv:2405.01466, 2025
-
[18]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, et al., “LLaMA: Open and Efficient Foundation Language Models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review arXiv 2023
-
[19]
Available: https://ollama.com
Ollama, “Ollama,” 2023. Available: https://ollama.com
2023
-
[20]
Psychometrika12(2), 153–157 (1947)https://doi
Q. McNemar, “Note on the sampling error of the difference between correlated proportions,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947, doi:10.1007/BF02295996. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.