Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
arXiv.https://arxiv.org/abs/ 2510.11822
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 2polarities
background 2representative citing papers
Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
citing papers explorer
-
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
-
Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents
Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.
-
A Finite-Calibration Regime Map for LLM Judge Panels
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.