Recognition: unknown
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Pith reviewed 2026-05-10 03:07 UTC · model grok-4.3
The pith
Adversarial multi-agent review with kill mandates filters most false positives from LLM defect reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refute-or-Promote is an inference-time pattern that combines stratified context hunting for candidate generation, adversarial agents given explicit kill mandates at each gate, context asymmetry between reviewers, cross-model critique to catch correlated errors, and a final mandatory empirical validation step. No defect was found autonomously by the agents; the contribution lies in the external filtering structure that removed most plausible-but-wrong reports before they reached maintainers.
What carries the argument
The Refute-or-Promote pattern of successive promotion gates at which adversarial agents attempt to disprove each LLM-generated defect candidate.
If this is right
- High elimination rates mean far fewer incorrect reports reach human maintainers.
- Surviving candidates produced four CVEs plus accepted changes to the C++ standard and compilers.
- The empirical gate proved necessary when all reviewers initially endorsed a non-existent bug.
- A simplified variant also resolved previously unsolved instances on SWE-bench Verified.
Where Pith is reading between the lines
- The pattern indicates that LLM systems for discovery tasks gain reliability more from external refute structures than from internal improvements alone.
- Automating the empirical gate with targeted test harnesses could allow the method to scale to larger candidate volumes.
- Similar staged kill-mandate review could apply to other domains where LLMs produce plausible but unverifiable outputs, such as hypothesis generation in science.
Load-bearing premise
Adversarial agents supplied with kill mandates and separated contexts can eliminate false-positive defect reports without also discarding genuine defects, with the empirical gate acting as the final check.
What would settle it
A test set containing both confirmed real defects and known fabricated reports where the pipeline either advances a fabricated report to disclosure or discards a real defect that independent verification later confirms.
read the original abstract
LLM-assisted defect discovery has a precision crisis: plausible-but-wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute-or-Promote, an inference-time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross-Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold-start reviewers are intended to reduce anchoring cascades; cross-family review can catch correlated blind spots that same-family review misses. Over a 31-day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated-protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security-related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140-3 normative compliance issues under coordinated disclosure -- all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross-family critique variant also solved five previously unsolved SymPy instances on SWE-bench Verified and one SWE-rebench hard task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Refute-or-Promote, an inference-time adversarial stage-gated multi-agent methodology for LLM-assisted defect discovery. It combines Stratified Context Hunting (SCH) for candidate generation, adversarial agents with kill mandates, context asymmetry to mitigate anchoring, a Cross-Model Critic (CMC), and a mandatory empirical gate. Over a 31-day campaign on 7 targets, the pipeline reports retrospective kill rates of ~79% on 171 candidates and prospective rates of 83% on a 30-candidate subset (lcms2, wolfSSL), yielding 4 CVEs, acceptance of LWG 4549 into the C++ working paper, 5 merged editorial PRs, 3 compiler bugs, 8 merged security fixes, an RFC 9000 errata, and FIPS issues—all validated via external acceptance rather than internal benchmarks. The paper highlights a failure case in which 10 reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL CMS, killed only by empirical testing, and includes a preliminary transfer test on SWE-bench Verified.
Significance. If the filtering claims hold, the work provides a practical, externally validated pattern for raising precision in LLM-driven security analysis and standards work, with demonstrated real-world impact through CVEs and accepted changes. The explicit use of external acceptance (CVEs, PR merges, working-paper acceptance) as the evaluation criterion, rather than synthetic benchmarks, is a strength, as is the detailed documentation of the unanimous false-positive endorsement that motivated the empirical gate. The preliminary SWE-bench transfer result suggests broader applicability beyond defect discovery.
major comments (2)
- [Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.
- [Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.
minor comments (2)
- [Abstract] The abstract refers to a 'consolidated-protocol subset' without defining the protocol differences or selection criteria, which obscures interpretation of the 83% prospective kill rate.
- [Evaluation section] Candidate lists, per-gate kill reasons, and decision logs are not provided even in summary form, limiting reproducibility and independent assessment of the 79%/83% aggregates.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the real-world impact demonstrated through external validations such as CVEs and standards acceptances. We address the major comments below, agreeing that certain quantifications are absent and proposing partial revisions to clarify limitations while preserving the manuscript's emphasis on external evaluation criteria.
read point-by-point responses
-
Referee: [Abstract and evaluation section] Abstract and evaluation section: The central claim that adversarial kill mandates, context asymmetry, and the mandatory empirical gate reliably eliminate persistent false positives without collateral loss of true defects is not supported by any measurement of false-negative rate. No controlled experiment on labeled data or ground-truth defect pool is reported to quantify how many true defects (if any) were discarded at any gate; only promoted items' external acceptance is shown.
Authors: We agree that no controlled false-negative rate is reported, as the real-world setting provides no complete ground-truth pool of all possible defects across the targets. Evaluation instead relies on external acceptance of promoted items (4 CVEs, LWG 4549, merged PRs, etc.). The mandatory empirical gate is motivated by the documented case of unanimous reviewer endorsement of a non-existent Bleichenbacher oracle, which was killed only by testing. We do not assert zero collateral loss but note that the structure surfaced multiple externally validated defects. We will revise the evaluation section to explicitly discuss this limitation and the rationale for prioritizing external validation over synthetic FNR metrics. revision: partial
-
Referee: [Methodology description] Methodology description: The account of how cold-start reviewers and cross-family critique prevent loss of true positives lacks any formal safeguard, backtracking rule, or ablation showing survival rates for known-true candidates; without this, the assertion that the structure 'filters … persistent false positives' without discarding true defects remains unquantified.
Authors: We acknowledge the absence of formal safeguards, backtracking rules, or ablations on known-true candidates. Cold-start reviewers and cross-family critique are intended to mitigate anchoring and correlated blind spots, but without labeled data these mechanisms cannot be quantified via survival rates. The 31-day campaign did promote multiple defects that received external acceptance. We will revise the methodology section to clarify the design intent of these elements and to note the lack of ablations as a limitation, while emphasizing that the pipeline's value is shown through real-world outcomes rather than internal benchmarks. revision: partial
Circularity Check
No circularity: outcomes rest on external acceptances independent of internal pipeline definitions
full rationale
The paper presents a multi-agent review methodology and reports aggregate kill rates plus specific external outcomes (4 CVEs, LWG 4549 acceptance, merged PRs, etc.) from a real-world campaign. These results are evaluated by third-party acceptance rather than by any internal prediction, fitted parameter, or quantity defined by the methodology itself. No equations, self-citations, ansatzes, or uniqueness theorems appear in the provided text that would reduce the reported findings to the inputs by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents produce plausible-but-wrong defect reports at high rates that require external adversarial filtering to reach usable precision.
Reference graph
Works this paper leans on
-
[1]
Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024
Google Project Zero and Google DeepMind. Big Sleep: Real-world vulnerability discovery with large language models.Google Security Blog, November 2024
2024
-
[2]
Google DeepMind. Project Naptime: Evaluating 4https://github.com/abhinavagarwal07/refute-or-promote 5DOI: 10.5281/zenodo.19668799 8 Offensive Security Capabilities of Large Language Models.Google Security Blog, June 2024
- [3]
-
[4]
Ullah, M
S. Ullah, M. Sultana, and L. Williams. On the Reliability of Large Language Models to Mitigate Security Vulnerabilities. InIEEE S&P 2024
2024
-
[5]
AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025
Semgrep, Inc. AI-Assisted SAST: False Positive Rates in Production.semgrep.dev/blog, 2025
2025
-
[6]
Stenberg
D. Stenberg. AI bug reports and the maintainer burden.daniel.haxx.se/blog, 2024–2026
2024
-
[7]
Stenberg
D. Stenberg. The end of the curl bug bounty. daniel.haxx.se/blog, January 2026
2026
-
[8]
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Khan et al
A. Khan et al. Debating with More Persuasive LLMs Leads to More Truthful Answers. InICML 2024 (Best Paper)
2024
- [10]
- [11]
- [12]
-
[13]
POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025
K. Huang, Y. Jin, R. Li, M. Y. Li, E. Candès, and J. Leskovec. POPPER: Automated Hypothe- sis Validation with Agentic Sequential Falsifications. arXiv:2502.09858, 2025. InICML 2025
-
[14]
arXiv preprint arXiv:2502.20379 , year=
D. Lifshitz et al. Multi-Agent Verification: Scal- ing Test-Time Compute with Multiple Verifiers. arXiv:2502.20379, February 2025
-
[15]
R. Widyasari, M. Weyssow, I. C. Irsan, H. W. Ang, F. Liauw, E. L. Ouh, L. K. Shar, H. J. Kang, and D. Lo. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents (VulTrial).arXiv:2505.10961, May 2025. Accepted at ICSE 2026
-
[16]
Vlcek, J
O. Vlcek, J. Baloo, and S. Fort. AISLE: An AI- Native Cyber Reasoning System for Autonomous Vulnerability Remediation. https://aisle.com/ about-us, 2025–2026
2025
- [17]
- [18]
- [19]
-
[20]
Zero-Day Vulnerability Discovery with Frontier Models
Anthropic. Zero-Day Vulnerability Discovery with Frontier Models. https://red.anthropic.com/ 2026/zero-days/, February 2026
2026
-
[21]
Codex Security: Now in Research Preview
OpenAI. Codex Security: Now in Research Preview. openai.com, March 2026
2026
-
[22]
Avizienis and J
A. Avizienis and J. P. J. Kelly. Fault Tolerance by Design Diversity: Concepts and Experiments. Computer, 17(8):67–80, 1984
1984
-
[23]
J. C. Knight and N. G. Leveson. An Experimen- tal Evaluation of the Assumption of Independence in Multiversion Programming.IEEE Transactions on Software Engineering, SE-12(1):96–109, January 1986
1986
-
[24]
Y. Xiong and T. Zhang. Sifting the Noise: A Com- parative Study of LLM Agents in Vulnerability False Positive Filtering.arXiv:2601.22952, January 2026
- [25]
-
[26]
Kraidia, I
I. Kraidia, I. Qaddara, A. Almutairi, N. Alzaben, and S. B. Belhouari. When Collaboration Fails: Persuasion-Driven Adversarial Influence in Multi- Agent LLM Debate.Scientific Reports, April 2026
2026
-
[27]
Claude Mythos Preview
Anthropic. Claude Mythos Preview. red.anthropic.com, April 2026. https: //red.anthropic.com/2026/mythos-preview/
2026
-
[28]
Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026
Anthropic. Project Glasswing: Securing Critical Software for the AI Era.anthropic.com, April 2026. https://www.anthropic.com/glasswing
2026
-
[29]
Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026
OpenAI. Trusted Access for the Next Era of Cyber Defense.openai.com, April 2026. https://openai.com/index/ scaling-trusted-access-for-cyber-defense/
2026
- [30]
-
[31]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. InProceedings of NAACL 2024, pp. 5377–5400.arXiv:2308.01263
work page internal anchor Pith review arXiv 2024
-
[32]
Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,
J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. OR-Bench: An Over-Refusal Benchmark for Large Language Models. InICML 2025.arXiv:2405.20947
- [33]
-
[34]
Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis
P.Stokes. Buildingan AdversarialConsensus Engine: Multi-Agent LLMs for Automated Malware Analysis. SentinelOne Labs, March 2026
2026
- [35]
- [36]
-
[37]
S. Joyce. Cloud CISO Perspectives: Our Big Sleep Agent Makes a Big Leap.Google Cloud Blog, July 2025
2025
-
[38]
Franceschi-Bicchierai
L. Franceschi-Bicchierai. Google Says Its AI- Based Bug Hunter Found 20 Security Vulnerabilities. TechCrunch, August 2025
2025
-
[39]
S. Heelan. How I Used o3 to Find CVE-2025-37899, a Remote Zeroday Vulnerability in the Linux Kernel’s SMB Implementation.sean.heelan.io, May 2025
2025
- [40]
-
[41]
R. A. Popa and F. Flynn. Introducing CodeMender: An AI Agent for Code Security.Google DeepMind Blog, October 2025
2025
-
[42]
M. Cooter. Internet Bug Bounty Program Hits Pause on Payouts.InfoWorld, April 2026
2026
-
[43]
Reflexion: Language Agents with Verbal Reinforcement Learning
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InNeurIPS 2023. arXiv:2303.11366
work page internal anchor Pith review arXiv 2023
-
[44]
Self-Refine: Iterative Refinement with Self-Feedback
A. Madaan, N. Tandon, P. Gupta, et al. Self-Refine: Iterative Refinement with Self-Feedback. InNeurIPS 2023.arXiv:2303.17651
work page internal anchor Pith review arXiv 2023
-
[45]
G. Irving, P. Christiano, and D. Amodei. AI Safety via Debate.arXiv:1805.00899, 2018
work page internal anchor Pith review arXiv 2018
-
[46]
Y. Denisov-Blanch, J. Kazdan, J. Chudnovsky, R. Schaeffer, S. Guan, S. Adeshina, and S. Koyejo. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness. arXiv:2603.06612, 2026
- [47]
- [48]
-
[49]
Z. Liang, Q. Xie, J. He, B. Xue, W. Wang, Y. Cai, F. Luo, B. Zhang, H. Hu, and K. Wu. Ar- gus: Multi-Agent Ensemble Reorchestrating Static Analysis for Full-Chain Vulnerability Detection. arXiv:2604.06633, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [50]
-
[51]
D3: Dissecting multi-agent debate for LLM evaluation, 2024
A. Harrasse, C. Bandi, and H. Bandi. Debate, De- liberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Eval- uation.arXiv:2410.04663, 2024. EACL 2026
-
[52]
Tianjun Wang, Yujia Liu, Yiming Zhang, and 1 others
S. Ullah, P. Balasubramanian, W. Guo, A. Burnett, H. Pearce, C. Kruegel, G. Vigna, and G. Stringhini. From CVE Entries to Verifiable Exploits: CVE- GENIE Multi-Agent Framework.arXiv:2509.01835, September 2025. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.