Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection
Pith reviewed 2026-05-20 11:29 UTC · model grok-4.3
The pith
An adaptation of Wald's Sequential Probability Ratio Test serves as a compute governor for multi-agent LLM debates by stopping early when an LLM judge detects sufficient consensus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Wald's SPRT, applied to successive [0,1] consensus scores emitted by an LLM judge after each debate round and modeled with a Beta likelihood, yields a stopping rule that stops early on items where agents converge usefully and caps at a maximum round count otherwise, inheriting type-I and type-II error guarantees from the classical test under i.i.d. score assumptions; in practice the calibration step on disjoint data is the practical object because it quantifies how well the judge separates the two regimes in the target domain.
What carries the argument
Wald's Sequential Probability Ratio Test (SPRT) adapted as a compute governor: it accumulates the log-likelihood ratio of 'useful convergence' versus 'not yet useful' under a Beta family for the observed consensus scores and crosses an upper boundary to accept convergence or a lower boundary to reject it.
If this is right
- On GSM8K items the rule averages 1.01 rounds and 4.06 LLM calls while reaching 97% accuracy, compared with 5 rounds and 15 calls at 99% for fixed debate.
- The method returns a capped best-effort outcome on items where the ratio never crosses either boundary within R_max rounds.
- Calibration on disjoint subsets estimates the separation quality of the judge's consensus score for the given domain and model set.
- Under the i.i.d. model the procedure controls type-I and type-II error rates at user-chosen levels.
Where Pith is reading between the lines
- The same governor could be attached to other multi-step LLM workflows that produce intermediate quality signals.
- When calibration indicates the judge score carries no information the rule simply caps on nearly every item, which is a conservative default.
- Classical sequential analysis supplies lightweight failure detection without requiring new training or fine-tuning of the agents or judge.
Load-bearing premise
The assumption that successive consensus scores are independent and identically distributed so that the accumulated log-likelihood ratio behaves as in the classical SPRT, plus the assumption that calibration on a small disjoint set can determine whether those scores separate useful from unhelpful debate progress.
What would settle it
Run the debate on a large set of items where the judge's consensus scores show no statistical separation between early-stop and full-round outcomes; if the rule then either stops too early on incorrect answers or caps on almost everything, the calibration-based separation claim is falsified.
Figures
read the original abstract
Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes adapting Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for multi-agent LLM debates. After each round an LLM judge emits a [0,1] consensus score; a Wald monitor accumulates the log-likelihood ratio of useful vs. not-yet-useful convergence under a Beta likelihood family and stops when a boundary is crossed or a cap at R_max is reached. The paper claims that under i.i.d. assumptions the procedure inherits classical SPRT type-I and type-II error guarantees. It reports Monte-Carlo simulations under calibrated Beta models together with real-LLM experiments on 200 MMLU and 200 GSM8K items using three heterogeneous agents and a disjoint 40-item calibration set, showing a 3.7x reduction in LLM calls on GSM8K at a 2 pp accuracy cost and frequent capping on MMLU.
Significance. If the empirical behaviour generalises, the work supplies a lightweight, statistically motivated layer for dynamic compute allocation in multi-agent LLM systems. The explicit separation of calibration from deployment and the reporting of both Monte-Carlo working curves and real-LLM outcomes are positive features. The central practical takeaway—that a classical sequential test can serve as a cheap failure-detection and cost-control mechanism—is potentially useful for production LLM pipelines.
major comments (1)
- [Abstract / method] Abstract and method description: the claim that the rule 'inherits SPRT type-I/type-II error guarantees' under i.i.d. assumptions is load-bearing for the theoretical contribution. Successive consensus scores are generated after each round conditions on all prior agent positions, inducing serial dependence that violates the independent-increments property required for the classical SPRT bounds. The Monte-Carlo study draws i.i.d. Beta variates and therefore cannot diagnose this issue; no autocorrelation diagnostics or adjusted error-rate estimates are supplied for the 200-item GSM8K/MMLU runs.
minor comments (2)
- [Abstract] Abstract: numerical results (1.01 rounds, 97.0 % accuracy, 4.06 calls) are reported without error bars, standard deviations, or exact SPRT boundary values and Beta parameters used in the real-LLM track.
- [Evaluation] Evaluation: the manuscript should clarify the precise definition of 'useful convergence' that underlies the Beta likelihood and how the 40-item calibration subsets are used to set the decision boundaries.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and precise identification of the independence assumption. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / method] Abstract and method description: the claim that the rule 'inherits SPRT type-I/type-II error guarantees' under i.i.d. assumptions is load-bearing for the theoretical contribution. Successive consensus scores are generated after each round conditions on all prior agent positions, inducing serial dependence that violates the independent-increments property required for the classical SPRT bounds. The Monte-Carlo study draws i.i.d. Beta variates and therefore cannot diagnose this issue; no autocorrelation diagnostics or adjusted error-rate estimates are supplied for the 200-item GSM8K/MMLU runs.
Authors: We agree that successive consensus scores are serially dependent, as each round's agent outputs condition on prior positions and the judge evaluates the evolving debate state. The manuscript qualifies the SPRT inheritance claim with the explicit phrase 'under i.i.d. assumptions,' framing it as an idealized theoretical reference rather than a strict guarantee for the deployed system. The Monte-Carlo experiments are intentionally conducted under the i.i.d. Beta model to isolate the behavior of the Wald boundaries and calibration procedure; they are not intended to simulate the full dependence structure. In the real-LLM experiments the calibration set directly estimates the separation between 'useful' and 'not-yet-useful' scores from observed data, which remains valid even under dependence. To strengthen the presentation we will (i) revise the abstract and method sections to state that the classical bounds are approximate under serial dependence, (ii) add lag-1 autocorrelation coefficients and partial autocorrelation plots for the judge scores on the GSM8K and MMLU runs, and (iii) report empirical type-I and type-II error proxies obtained by treating the fixed-round-5 outcome as ground truth. These additions will be included in the revised manuscript. revision: yes
Circularity Check
No significant circularity; derivation uses standard SPRT under explicit assumptions with external calibration
full rationale
The paper adapts Wald's classical SPRT as a stopping rule for LLM debate rounds, modeling consensus scores with a Beta likelihood and stopping on log-likelihood ratio boundaries. It explicitly states that type-I/II error guarantees hold only under i.i.d. assumptions on the scores and emphasizes that calibration on disjoint 40-item subsets is the key practical component for estimating whether scores separate useful vs. unhelpful convergence. The Monte-Carlo track simulates directly from the calibrated Beta model, while the real-LLM track (200 MMLU + 200 GSM8K items) applies the rule after calibration on held-out data and reports empirical accuracy and cost. No equation or claim reduces the inherited guarantees, the stopping behavior, or the reported performance metrics to a fitted parameter by construction; the i.i.d. assumption is presented as a modeling choice rather than derived from the target data, and all performance numbers come from direct measurement on separate items.
Axiom & Free-Parameter Ledger
free parameters (2)
- SPRT decision boundaries
- Beta likelihood parameters
axioms (1)
- domain assumption i.i.d. assumptions on successive consensus scores across rounds
Reference graph
Works this paper leans on
-
[1]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing (ICML), 2024, arXiv:2305.14325
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, arXiv:2305.19118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations (ICLR), 2023, arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
Discrete sequential boundaries for clinical trials,
K. K. G. Lan and D. L. DeMets, “Discrete sequential boundaries for clinical trials,”Biometrika, vol. 70, no. 3, pp. 659–663, 1983
work page 1983
-
[6]
Peeking at A/B tests: Why it matters, and what to do about it,
R. Johari, P. Koomen, L. Pekelis, and D. Walsh, “Peeking at A/B tests: Why it matters, and what to do about it,” inProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017
work page 2017
-
[7]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023, arXiv:2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” inInternational Conference on Learning Representations (ICLR), 2021, arXiv:2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Training verifiers to solve math word problems,
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021
work page 2021
-
[10]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
A. Morandi, “RTLC – research, teach-to-learn, critique: A three-stage prompting paradigm inspired by the Feynman learning technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning,” 2026. [Online]. Available: https://arxiv.org/abs/2605.13695
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.