Recognition: unknown
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3
The pith
AnyPoC generates executable proof-of-concept tests to automatically validate and confirm bugs reported by LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnyPoC is a multi-agent system that first fact-checks a candidate bug report, then iteratively builds and runs a PoC while logging execution traces, and finally re-executes the PoC under independent scrutiny to guard against hallucination or reward hacking. It also maintains an evolving knowledge base of successful PoCs. When applied to reports from a simple agentic reporter, the system produces 1.3 times more valid PoCs on true bugs and rejects 9.8 times more false reports than state-of-the-art coding agents, resulting in 122 newly discovered bugs of which 105 have been confirmed.
What carries the argument
The three-stage multi-agent pipeline of fact-checking, trace-guided iterative PoC synthesis, and independent re-execution, augmented by an accumulating PoC knowledge base.
If this is right
- LLM-based bug detectors can be chained directly into confirmed findings without human triage.
- Generated PoCs can serve as ready-made regression tests, as already happened for 45 cases.
- The same validation layer can be attached to any existing or future bug reporter.
- Continuous growth of the PoC knowledge base should improve success rates on new projects over time.
Where Pith is reading between the lines
- The same staged validation pattern might transfer to domains outside software, such as hardware design or formal specification errors.
- If the knowledge base is made project-agnostic, it could reduce the need for per-system tuning.
- Pairing AnyPoC with stronger base models would likely increase the absolute number of confirmed bugs, but the relative gain over baselines should be measured separately.
Load-bearing premise
The combination of fact-checking, execution-trace feedback, and independent re-execution is sufficient to block LLM hallucination and reward-hacking on bug reports from many different languages and projects.
What would settle it
Apply AnyPoC to a fresh collection of known false-positive bug reports from one of the evaluated systems and measure whether it still rejects the large majority of them.
Figures
read the original abstract
While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AnyPoC, a multi-agent framework for synthesizing executable proof-of-concept (PoC) tests from LLM-generated candidate bug reports. It combines fact-checking of reports, iterative PoC synthesis using execution traces, and independent re-execution to reduce hallucination and reward-hacking, while maintaining an evolving PoC knowledge base. The framework is claimed to be general across bug report sources and is evaluated on 12 large real-world systems (e.g., Firefox, Chromium, LLVM) using a simple agentic bug reporter, outperforming baselines like Claude Code and Codex by producing 1.3x more valid PoCs on true positives and rejecting 9.8x more false positives. It reports discovering 122 new bugs (105 confirmed, 86 fixed) with 45 PoCs adopted as regression tests.
Significance. If the evaluation criteria prove reproducible and non-circular, AnyPoC could meaningfully advance scalable, end-to-end automated bug detection by turning static LLM reports into validated, executable evidence. The scale of the evaluation on production systems with millions of lines of code, the concrete bug-discovery count, and the adoption of generated PoCs as official tests are concrete strengths that would support practical impact in software engineering. The cross-language generality and knowledge-base evolution are also positive features that distinguish it from narrower test-generation approaches.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 1.3x more valid PoCs for true-positive reports and 9.8x more rejections of false-positive reports rest on unstated criteria for (a) what constitutes a 'valid PoC' and (b) how true/false-positive labels are assigned independently of AnyPoC's own fact-checker. Because the bug reporter is itself an LLM agent, any dependence on AnyPoC's acceptance for labeling would introduce circularity that inflates both multipliers; the manuscript provides no quantitative breakdown of ground-truth sources, inter-rater agreement, or cases where re-execution passed but developers later invalidated the PoC.
- [Framework and Evaluation] Framework (steps 1-3) and Evaluation: The description of fact-checking, iterative synthesis with traces, and independent re-execution is high-level and does not include failure-mode analysis or metrics demonstrating that the combination reliably prevents hallucination and reward-hacking across the 12 heterogeneous systems and languages. Without such evidence or ablation results, it is unclear whether new failure modes are introduced, which directly affects the soundness of the reported gains and the 122-bug discovery claim.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from an explicit statement of the exact baseline agents, the precise definition of 'valid PoC' used in the 1.3x metric, and the statistical significance tests applied to the reported multipliers.
- [Evaluation] Table or figure presenting the per-system breakdown of PoC validity rates, false-positive rejection rates, and bug-discovery counts would improve clarity and allow readers to assess consistency across domains.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below with clarifications and commit to revisions that will improve the transparency and rigor of the evaluation and framework description.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 1.3x more valid PoCs for true-positive reports and 9.8x more rejections of false-positive reports rest on unstated criteria for (a) what constitutes a 'valid PoC' and (b) how true/false-positive labels are assigned independently of AnyPoC's own fact-checker. Because the bug reporter is itself an LLM agent, any dependence on AnyPoC's acceptance for labeling would introduce circularity that inflates both multipliers; the manuscript provides no quantitative breakdown of ground-truth sources, inter-rater agreement, or cases where re-execution passed but developers later invalidated the PoC.
Authors: We appreciate the concern about potential circularity. Valid PoCs are defined as executable artifacts that reproduce the reported bug behavior via successful execution and trace verification. True/false-positive labels for input reports were assigned using external sources including project bug trackers and developer confirmations (105 of the 122 reported bugs were developer-confirmed). We will revise the Evaluation section to explicitly define 'valid PoC', provide a quantitative breakdown of ground-truth sources, and note any post-validation discrepancies. Since labeling relied on objective execution outcomes and external confirmations rather than subjective multi-rater assessment, inter-rater agreement statistics were not computed; we will add this as an explicit limitation. revision: yes
-
Referee: [Framework and Evaluation] Framework (steps 1-3) and Evaluation: The description of fact-checking, iterative synthesis with traces, and independent re-execution is high-level and does not include failure-mode analysis or metrics demonstrating that the combination reliably prevents hallucination and reward-hacking across the 12 heterogeneous systems and languages. Without such evidence or ablation results, it is unclear whether new failure modes are introduced, which directly affects the soundness of the reported gains and the 122-bug discovery claim.
Authors: We agree that more detailed evidence is needed. We will revise the Framework and Evaluation sections to include ablation studies quantifying the contribution of fact-checking, trace-guided iteration, and independent re-execution to valid PoC rates and false-positive rejection across all 12 systems. We will also add a failure-mode analysis subsection describing observed issues (e.g., incomplete traces in certain languages) and how the evolving knowledge base mitigates them. All 122 discovered bugs received external confirmation, and the revised analysis will confirm that the component combination did not introduce new failure modes undermining the claims. revision: yes
Circularity Check
No circularity: empirical evaluation on external projects with direct agent comparisons
full rationale
The paper presents an empirical framework evaluated on 12 real-world open-source systems (Firefox, Chromium, LLVM, etc.) with direct quantitative comparisons to external baselines such as Claude Code and Codex. No equations, fitted parameters, or first-principles derivations are present in the provided text. Reported gains (1.3x valid PoCs, 9.8x false-positive rejections) and bug counts are framed as outcomes of execution-based validation on independent codebases rather than quantities defined in terms of the system's own outputs or prior self-citations. The evaluation chain does not reduce any claimed result to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can be reliably prompted to perform fact-checking, code synthesis, and trace analysis for bug reports
Forward citations
Cited by 1 Pith paper
-
Agentic Vulnerability Reasoning on Windows COM Binaries
SLYP agentic pipeline discovers race condition vulnerabilities in Windows COM binaries and generates debugger-verified PoCs, scoring 0.973 F1 on a 40-case benchmark and finding 28 new confirmed vulnerabilities in prod...
Reference graph
Works this paper leans on
-
[1]
Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InForty-second International Conference on Machine Learning. https: //openreview.net/forum?id=b0jYs6JOZu
2025
- [2]
-
[3]
Anthropic. 2025. Claude Code. https://github.com/anthropics/claude-code
2025
-
[4]
Anthropic. 2026. Code Review - Claude Code Docs. https://code.claude.com/ docs/en/code-review
2026
-
[5]
Anthropic. 2026. Partnering with Mozilla to improve Firefox’s security. https: //www.anthropic.com/news/mozilla-firefox-security
2026
-
[6]
Abhishek Arya, Oliver Chang, Jonathan Metzman, Kostya Serebryany, and Dongge Liu. 2016. OSS-Fuzz. https://github.com/google/oss-fuzz. https: //github.com/google/oss-fuzz
2016
-
[7]
Bytecode Alliance. 2026. Wasmtime: A lightweight WebAssembly runtime that is fast, secure, and standards-compliant. https://github.com/bytecodealliance/ wasmtime
2026
- [8]
-
[9]
Foundry Contributors. 2026. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development written in Rust. https://github. com/foundry-rs/foundry
2026
-
[10]
Cursor. 2026. Bugbot. https://cursor.com/bugbot
2026
-
[11]
O. J. Dahl, E. W. Dijkstra, and C. A. R. Hoare (Eds.). 1972.Structured programming. Academic Press Ltd., GBR
1972
-
[12]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: an efficient SMT solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems(Budapest, Hungary)(TACAS’08/ETAPS’08). Springer-Verlag, Berlin, Heidelberg, 337–340
2008
-
[13]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435
2023
-
[14]
Will Dietz, Peng Li, John Regehr, and Vikram Adve. 2012. Understanding Integer Overflow in C/C++. InProceedings of the 2012 International Conference on Software Engineering(Zurich, Switzerland)(ICSE 2012). IEEE Press, Piscataway, NJ, USA, 760–770. http://dl.acm.org/citation.cfm?id=2337223.2337313
- [15]
- [16]
-
[17]
Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos
Mafalda Ferreira, Miguel Monteiro, Tiago Brito, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2024. Efficient Static Vulnerability Analysis for JavaScript with Multiversion Dependency Graphs.Proc. ACM Program. Lang. 8, PLDI, Article 164 (June 2024), 25 pages. doi:10.1145/3656394
-
[18]
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419
2011
-
[19]
GitHub. 2025. Copilot code review now generally available. GitHub Changelog. https://github.blog/changelog/2025-04-04-copilot-code-review- now-generally-available/
2025
-
[20]
GitHub. 2026. CodeQL. https://codeql.github.com/
2026
-
[21]
Google. 2026. Syzkaller. https://github.com/google/syzkaller/
2026
-
[22]
Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=TXcifVbFpG
2025
-
[23]
Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29
2020
-
[24]
Yuchen Ji, Ting Dai, Zhichao Zhou, Yutian Tang, and Jingzhu He. 2025. Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM- Assisted Inter-Procedural Path-Sensitive Taint Analysis.Proceedings of the ACM on Programming Languages9, OOPSLA1 (2025), 1349–1377
2025
-
[25]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net/forum?id=VTF8yNQM66
2024
-
[26]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440. doi:10.1145/2610384.2628055
-
[27]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few- Shot Testers: Exploring LLM-Based General Bug Reproduction. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2312–2323. doi:10.1109/ICSE48619.2023.00194
-
[28]
Jon Kaplan. 2026. Building a better Bugbot. Cursor Blog. https://cursor.com/ blog/building-bugbot
2026
-
[29]
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security. 2123–2138
2018
-
[30]
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=QQhQIqons0
2025
-
[31]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing static analysis for practical bug detection: An llm-integrated approach.Proceedings of the ACM on Programming Languages8, OOPSLA1 (2024), 474–499
2024
- [32]
-
[33]
LLVM Project. 2026. Clang Static Analyzer. https://clang-analyzer.llvm.org/
2026
-
[34]
LLVM Project. 2026. TableGen Overview. https://llvm.org/docs/TableGen/
2026
-
[35]
Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos
Filipe Marques, Mafalda Ferreira, André Nascimento, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2025. Automated Exploit Generation for Node.js Packages.Proc. ACM Program. Lang.9, PLDI, Article 201 (June 2025), 26 pages. doi:10.1145/3729304
-
[36]
Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large Language Model guided Protocol Fuzzing.. InNDSS
2024
-
[37]
Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Ab- hishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. InProceedings of the 29th ACM joint meeting on European software en- gineering conference and symposium on the foundations of software engineering. 1393–1403
2021
-
[38]
Mozilla Foundation. 2026. SpiderMonkey JavaScript/WebAssembly Engine. https: //spidermonkey.dev/
2026
-
[39]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy- weight dynamic binary instrumentation.SIGPLAN Not.42, 6 (June 2007), 89–100. doi:10.1145/1273442.1250746
- [40]
-
[41]
Peter W. O’Hearn. 2019. Incorrectness logic.Proc. ACM Program. Lang.4, POPL, Article 10 (Dec. 2019), 32 pages. doi:10.1145/3371078
-
[42]
OpenAI. 2025. Introducing Aardvark: OpenAI’s agentic security researcher. https://openai.com/index/introducing-aardvark/
2025
-
[43]
OpenAI. 2025. OpenAI Codex CLI. https://github.com/openai/codex
2025
-
[44]
OpenAI. 2026. Codex Security: now in research preview. https://openai.com/ index/codex-security-now-in-research-preview/
2026
-
[45]
Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The mutators reloaded: Fuzzing compilers with large language model generated mutation operators. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4. 298–312
2024
-
[46]
Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object- oriented programming systems and applications companion. 815–816
2007
-
[47]
Laura Plein, Wendkûuni C Ouédraogo, Jacques Klein, and Tegawendé F Bissyandé
-
[48]
InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings
Automatic generation of test cases based on bug reports: a feasibility study with large language models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 360– 361
2024
- [49]
-
[50]
Semgrep. 2026. Semgrep. https://github.com/semgrep/semgrep
2026
-
[51]
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. {AddressSanitizer}: A fast address sanity checker. In2012 USENIX annual technical conference (USENIX ATC 12). 309–318
2012
- [52]
-
[53]
Dokyung Song, Julian Lettner, Prabhu Rajasekaran, Yeoul Na, Stijn Volckaert, Per Larsen, and Michael Franz. 2019. SoK: Sanitizing for security. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1275–1295
2019
-
[54]
2025.Debugging with GDB: The GNU Source-Level Debugger
Richard Stallman, Roland Pesch, Stan Shebs, et al. 2025.Debugging with GDB: The GNU Source-Level Debugger. Free Software Foundation, Boston, MA. https: //www.gnu.org/software/gdb/documentation/ GDB Version 17.1. Zhao et al
2025
-
[55]
Evgeniy Stepanov and Konstantin Serebryany. 2015. MemorySanitizer: fast detector of uninitialized memory use in C++. In2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 46–55
2015
-
[56]
Big Sleep team. 2024. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code. https://googleprojectzero.blogspot. com/2024/10/from-naptime-to-big-sleep.html
2024
-
[57]
Chris Thunes. 2020. javalang: Pure Python Java parser and tools. https://github. com/c2nes/javalang
2020
- [58]
- [59]
- [60]
-
[61]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)
work page internal anchor Pith review arXiv 2024
-
[62]
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. InPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
2024
-
[63]
Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. 2025. KNighter: Transforming Static Analysis with LLM-Synthesized Checkers. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3731569.3764827
-
[64]
Chenyuan Yang, Zijie Zhao, and Lingming Zhang. 2025. KernelGPT: Enhanced Kernel Fuzzing via Large Language Models(ASPLOS ’25). Association for Com- puting Machinery, New York, NY, USA, 560–573. doi:10.1145/3676641.3716022
-
[65]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://arxiv.org/abs/2405. 15793
2024
-
[66]
Ho, and Percy Liang
Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glik- barg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny O Oseleononmen, Dan Boneh, ...
2025
-
[67]
Xin Zhou, Ting Zhang, and David Lo. 2024. Large language model for vulner- ability detection: Emerging results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. 47–51
2024
-
[68]
Tarek Ziadé, Ian Cordasco, and Anthony Sottile. 2024. flake8: Your Tool for Style Guide Enforcement. https://github.com/PyCQA/flake8. Version 7.1.1
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.