pith. sign in

arxiv: 2507.15671 · v2 · submitted 2025-07-21 · 💻 cs.SE

BugScope: Learn to Find Bugs Like Human

Pith reviewed 2026-05-19 03:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords bug detectionLLM auditingsoftware auditingcode analysisbug patternsguideline distillation
0
0 comments X

The pith

BugScope aligns LLMs to human-style bug auditing by distilling guidelines from real reports and examples, reaching 86 percent precision and 88 percent recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to structure code auditing into three explicit steps and then guide large language models through each step with concise guidelines extracted from actual bug reports and mutated variants. This produces substantially higher accuracy than current commercial tools on a test collection of 33 real bugs drawn from 21 open-source projects. The same system then located 184 previously unknown bugs inside large production codebases, many of which developers have already addressed. A reader should care because generated code is increasing faster than human review capacity, and the work demonstrates one concrete way to make automated auditing reliable enough for practical use.

Core claim

BugScope breaks auditing into seed identification, context retrieval, and bug detection, then aligns LLMs to each step by analyzing real bug reports and mutated examples to produce concise, reusable guidelines. On a curated set of 33 real-world bugs the method records 86.05 percent precision and 87.88 percent recall; the same pipeline later surfaces 184 new bugs in projects such as the Linux kernel, of which 78 have already been fixed.

What carries the argument

Three-step auditing workflow (seed identification, context retrieval, bug detection) driven by distilled concise guidelines extracted from real bug reports and mutated examples.

If this is right

  • The method records an F1 score of 0.87 on the 33-bug test set while leading industrial tools reach only 0.51 and 0.43.
  • Large-scale runs on real projects such as the Linux kernel surface 184 previously unknown bugs.
  • Of the newly found bugs, 78 have already been fixed and 7 have been explicitly confirmed by project developers.
  • The distilled guidelines are intended to be reusable across different codebases and auditing sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same example-driven guideline approach might transfer to other specialized code tasks such as security review or performance tuning.
  • Explicit workflow structuring plus guideline distillation appears to be a practical route for improving LLM reliability on narrow, high-stakes software tasks without full retraining.
  • Periodic re-distillation of guidelines from newly confirmed bugs could keep the system current as codebases and bug patterns evolve.

Load-bearing premise

Guidelines distilled from a modest collection of bug reports and mutations will continue to work on unseen bugs inside large, complex, and previously unexamined codebases.

What would settle it

Run the system on a fresh collection of 50 or more confirmed bugs from projects that were never seen during guideline creation and measure whether precision and recall both stay above 80 percent.

Figures

Figures reproduced from arXiv: 2507.15671 by Chengpeng Wang, Dominic Deluca, Jinjie Liu, Jinyao Guo, Xiangyu Zhang, Zhuo Zhang.

Figure 1
Figure 1. Figure 1: The examples of anti-patterns causing various types of bugs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of BUGSCOPE learning from a few representative examples. Specifically, they generalize from past cases, retrieve relevant code context on demand, and apply holistic reasoning to assess whether a bug is present. This flexibility allows them to adapt their strategy to different anti-patterns, including those adhering to subtle or system-specific behaviors. Motivated by this insight, we introduce… view at source ↗
read the original abstract

Software auditing is an increasingly critical task in the era of rapid code generation. While LLM-based auditors have demonstrated strong potential, their effectiveness remains limited by misalignment with the highly complex, domain-specific nature of bug detection. In this work, we introduce BugScope, a framework that mirrors how human auditors learn specific bug patterns from representative examples and apply this knowledge during code auditing. BugScope structures auditing into three steps: seed identification, context retrieval, and bug detection, and aligns LLMs to each step by analyzing real bug reports and mutated examples, and distilling concise, reusable guidelines. On a curated dataset of 33 real-world bugs from 21 widely used open-source projects, BugScope achieves 86.05\% precision and 87.88\% recall, corresponding to an F1 score of 0.87. By comparison, leading industrial tools such as Claude Code (with Claude Opus 4.6) and Cursor BugBot achieve F1 scores of only 0.51 and 0.43, respectively. Beyond benchmarks, large-scale evaluation on real-world projects such as the Linux kernel uncovered 184 previously unknown bugs, of which 78 have already been fixed and 7 explicitly confirmed by developers. Our code is available at https://github.com/jinyaoguo/BugScope

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. BugScope is a framework that mirrors human auditors by distilling concise, reusable guidelines from real bug reports and mutated examples. It structures LLM-based auditing into seed identification, context retrieval, and bug detection. On a curated set of 33 real-world bugs from 21 open-source projects, it reports 86.05% precision, 87.88% recall, and F1=0.87, outperforming Claude Code (F1=0.51) and Cursor BugBot (F1=0.43). Large-scale runs on the Linux kernel found 184 previously unknown bugs, of which 78 have been fixed and 7 confirmed by developers. The code is released at https://github.com/jinyaoguo/BugScope.

Significance. If the generalization holds, the work is significant for LLM-based software auditing: it provides a concrete mechanism for transferring human-derived bug patterns via guideline distillation rather than end-to-end prompting. The combination of competitive benchmark numbers on a multi-project set and externally validated findings in the Linux kernel (with developer confirmations) supplies practical evidence of utility. Explicit release of code is a clear strength that supports reproducibility.

major comments (1)
  1. [§5] §5 (Evaluation on the 33-bug benchmark and Linux kernel results): The central generalization claim—that guidelines distilled from the 33 curated bugs plus mutations transfer reliably to unseen bugs in large codebases—rests on a modest, hand-curated set without reported cross-validation across bug classes or ablation on guideline sensitivity. The Linux kernel findings are valuable external evidence, but the absence of a systematic failure-case breakdown or larger held-out test set leaves open the risk that reported precision/recall partly reflects selection effects rather than robust pattern transfer.
minor comments (2)
  1. [Abstract] Abstract: the parenthetical model reference for Claude Code should be expanded to the exact prompt template or temperature settings used, to allow direct replication of the baseline comparison.
  2. [§3] §3 (Guideline distillation): the description of how mutated examples are generated and filtered could be expanded with a short pseudocode listing or example guideline to improve clarity for readers attempting to reproduce the process.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the practical contributions of BugScope, including the code release and the Linux kernel findings. Below we respond to the major comment on the evaluation in §5.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation on the 33-bug benchmark and Linux kernel results): The central generalization claim—that guidelines distilled from the 33 curated bugs plus mutations transfer reliably to unseen bugs in large codebases—rests on a modest, hand-curated set without reported cross-validation across bug classes or ablation on guideline sensitivity. The Linux kernel findings are valuable external evidence, but the absence of a systematic failure-case breakdown or larger held-out test set leaves open the risk that reported precision/recall partly reflects selection effects rather than robust pattern transfer.

    Authors: We acknowledge that the 33-bug benchmark is modest in size and hand-curated, which could raise questions about selection effects. These bugs were deliberately drawn from 21 distinct open-source projects spanning multiple domains and bug categories to promote diversity. The guideline distillation further incorporates mutated examples to broaden coverage. We agree that additional analyses would strengthen the generalization argument. In the revised manuscript we will add a cross-validation experiment (e.g., leave-one-project-out) to demonstrate transfer across projects and bug classes. We will also include an ablation study examining the contribution and sensitivity of individual guideline components. In addition, we will provide a systematic qualitative breakdown of representative failure cases observed on both the benchmark and the Linux kernel runs. While a substantially larger held-out test set would require new curation beyond the current scope, the large-scale Linux kernel application—yielding 184 previously unknown bugs with 78 fixes and 7 explicit developer confirmations—constitutes substantial external evidence of pattern transfer to unseen production code. We will update §5 to incorporate these revisions. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes an empirical LLM-based framework that distills concise guidelines from a fixed set of 33 real-world bug reports plus mutated examples, then applies the resulting guidelines in a three-step auditing process. Reported performance numbers are obtained by direct measurement on the same curated benchmark plus independent large-scale runs on external projects such as the Linux kernel, where success is corroborated by subsequent developer fixes and explicit confirmations. No equations, fitted parameters, or model predictions are defined in terms of the target metrics; the distillation step is a one-time preprocessing step whose output is then evaluated on held-out or external data. No load-bearing self-citations or uniqueness theorems are invoked to justify the method. The central claims therefore rest on observable external outcomes rather than reducing to the input data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that human bug-finding patterns can be captured in concise, reusable textual guidelines that LLMs can reliably follow; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption LLMs can follow concise, reusable guidelines distilled from bug reports to perform structured auditing tasks.
    Invoked when the paper states that BugScope aligns LLMs to each step by distilling guidelines.

pith-pipeline@v0.9.0 · 5771 in / 1280 out tokens · 35106 ms · 2026-05-19T03:58:13.804985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

    cs.SE 2026-05 unverdicted novelty 6.0

    Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on...

  2. Code Semantic Zooming

    cs.HC 2025-10 unverdicted novelty 5.0

    CodeZoom is a pseudocode-based multi-layer abstraction tool that improves developer control and comprehension over LLM code generation compared to direct use of agents like Claude Code.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    McDaniel

    Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick D. McDaniel. Flowdroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. In Michael F. P. O'Boyle and Keshav Pingali (eds.), ACM SIGPLAN Conference on Programming Lang...

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  4. [4]

    Tree-sitter-a new parsing system for programming tools

    Max Brunsfeld. Tree-sitter-a new parsing system for programming tools. In Strange Loop Conference,. Accessed--. URL: https://www. thestrangeloop. com//tree-sitter---a-new-parsing-system-for-programming-tools. html, 2018

  5. [5]

    Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Richard Draves and Robbert van Renesse (eds.), 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings , pp.\ 209--224. U...

  6. [6]

    O’Hearn, and Hongseok Yang

    Cristiano Calcagno, Dino Distefano, Peter W. O'Hearn, and Hongseok Yang. Compositional shape analysis by means of bi-abduction. In Zhong Shao and Benjamin C. Pierce (eds.), Proceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2009, Savannah, GA, USA, January 21-23, 2009 , pp.\ 289--300. ACM , 2009. doi:10.1145/...

  7. [7]

    Seal: Towards diverse specification inference for linux interfaces from security patches

    Wei Chen, Bowen Zhang, Chengpeng Wang, Wensheng Tang, and Charles Zhang. Seal: Towards diverse specification inference for linux interfaces from security patches. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1246--1262, 2025

  8. [8]

    Coderabbit: Ai-powered code review for github

    CodeRabbit . Coderabbit: Ai-powered code review for github. https://www.coderabbit.ai/, 2025. Accessed: 2025-07-07

  9. [9]

    Bugbot documentation

    Cursor . Bugbot documentation. https://docs.cursor.com/bugbot, 2025. Accessed: 2025-06-26

  10. [10]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022

  11. [11]

    CodeQL for Java

    GitHub. CodeQL for Java . https://codeql.github.com/, 2025. [Online; accessed 13-July-2025]

  12. [12]

    Codeql documentation

    GitHub . Codeql documentation. https://codeql.github.com/docs/, 2025. Accessed: 2025-07-10

  13. [13]

    Oss-fuzz: Continuous fuzzing for open source software

    Google . Oss-fuzz: Continuous fuzzing for open source software. https://google.github.io/oss-fuzz/, 2025. Accessed: 2025-07-10

  14. [14]

    Repoaudit: An autonomous llm-agent for repository-level code auditing, 2025

    Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. Repoaudit: An autonomous llm-agent for repository-level code auditing, 2025. URL https://arxiv.org/abs/2501.18160

  15. [15]

    Precise divide-by-zero detection with affirmative evidence

    Yiyuan Guo, Jinguo Zhou, Peisen Yao, Qingkai Shi, and Charles Zhang. Precise divide-by-zero detection with affirmative evidence. In Proceedings of the 44th International Conference on Software Engineering, pp.\ 1718--1729, 2022

  16. [16]

    Precise compositional buffer overflow detection via heap disjointness

    Yiyuan Guo, Peisen Yao, and Charles Zhang. Precise compositional buffer overflow detection via heap disjointness. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.\ 63--75, 2024

  17. [17]

    Machine-learning-guided selectively unsound static analysis

    Kihong Heo, Hakjoo Oh, and Kwangkeun Yi. Machine-learning-guided selectively unsound static analysis. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp.\ 519--529. IEEE, 2017

  18. [18]

    Raisin: Identifying rare sensitive functions for bug detection

    Jianjun Huang, Jianglei Nie, Yuanjun Gong, Wei You, Bin Liang, and Pan Bian. Raisin: Identifying rare sensitive functions for bug detection. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp.\ 1--12, 2024

  19. [19]

    SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  20. [20]

    Why don't software developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pp.\ 672--681

    Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. Why don't software developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pp.\ 672--681. IEEE, 2013

  21. [21]

    Enhancing static analysis for practical bug detection: An llm-integrated approach

    Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. Enhancing static analysis for practical bug detection: An llm-integrated approach. Proceedings of the ACM on Programming Languages, 8 0 (OOPSLA1): 0 474--499, 2024

  22. [22]

    Neuro-symbolic Static Analysis with LLM-generated Vulnerability Patterns

    Penghui Li, Songchen Yao, Josef Sarfati Korich, Changhua Luo, Jianjia Yu, Yinzhi Cao, and Junfeng Yang. Automated static vulnerability detection via a holistic neuro-symbolic approach. CoRR, abs/2504.16057, 2025 a . doi:10.48550/ARXIV.2504.16057. URL https://doi.org/10.48550/arXiv.2504.16057

  23. [23]

    Iris: Llm-assisted static analysis for detecting security vulnerabilities

    Ziyang Li, Saikat Dutta, and Mayur Naik. Iris: Llm-assisted static analysis for detecting security vulnerabilities. In The Thirteenth International Conference on Learning Representations, 2025 b

  24. [24]

    Infer Static Analyzer

    Meta. Infer Static Analyzer . https://fbinfer.com/, 2025. [Online; accessed 13-July-2025]

  25. [25]

    Common Weakness Enumeration (CWE)

    MITRE . Common Weakness Enumeration (CWE) . https://cwe.mitre.org/index.html, 2025 a . URL https://cwe.mitre.org/index.html. Accessed: 2025-07-11

  26. [26]

    Cve - common vulnerabilities and exposures

    MITRE . Cve - common vulnerabilities and exposures. https://www.cve.org/, 2025 b . Accessed: 2025-07-10

  27. [27]

    Sporq: An interactive environment for exploring code using query-by-example

    Aaditya Naik, Jonathan Mendelson, Nathaniel Sands, Yuepeng Wang, Mayur Naik, and Mukund Raghothaman. Sporq: An interactive environment for exploring code using query-by-example. In The 34th Annual ACM Symposium on User Interface Software and Technology, pp.\ 84--99, 2021

  28. [28]

    Undecidability of context-sensitive data-dependence analysis

    Thomas Reps. Undecidability of context-sensitive data-dependence analysis. ACM Transactions on Programming Languages and Systems (TOPLAS), 22 0 (1): 0 162--186, 2000

  29. [29]

    Pinpoint: Fast and precise sparse value flow analysis for million lines of code

    Qingkai Shi, Xiao Xiao, Rongxin Wu, Jinguo Zhou, Gang Fan, and Charles Zhang. Pinpoint: Fast and precise sparse value flow analysis for million lines of code. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.\ 693--706, 2018

  30. [30]

    Path-sensitive sparse analysis without path conditions

    Qingkai Shi, Peisen Yao, Rongxin Wu, and Charles Zhang. Path-sensitive sparse analysis without path conditions. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pp.\ 930--943, 2021

  31. [31]

    SVF: interprocedural static value-flow analysis in LLVM

    Yulei Sui and Jingling Xue. SVF: interprocedural static value-flow analysis in LLVM . In Ayal Zaks and Manuel V. Hermenegildo (eds.), Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016 , pp.\ 265--266. ACM , 2016. doi:10.1145/2892208.2892235

  32. [32]

    LLMDFA : Analyzing dataflow in code with large language models

    Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, and Xiangyu Zhang. LLMDFA : Analyzing dataflow in code with large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  33. [33]

    Knighter: Transforming static analysis with llm-synthesized checkers

    Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. Knighter: Transforming static analysis with llm-synthesized checkers. CoRR, abs/2503.09002, 2025. doi:10.48550/ARXIV.2503.09002. URL https://doi.org/10.48550/arXiv.2503.09002

  34. [34]

    Rfcaudit: An llm agent for functional bug detection in network protocols,

    Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, and Xiangyu Zhang. An llm agent for functional bug detection in network protocols, 2025. URL https://arxiv.org/abs/2506.00714

  35. [35]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  36. [36]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  37. [37]

    Recent studies have adopted two complementary paradigms for leveraging LLMs in bug detection: LLM-augmented program analysis and LLM-based autonomous agents

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...