pith. sign in

arxiv: 2605.19193 · v1 · pith:ZL43OG6Snew · submitted 2026-05-18 · 💻 cs.LG

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

Pith reviewed 2026-05-20 11:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords sequential probability ratio testLLM debatecompute efficiencymulti-agent systemsconsensus scoringcalibrationstopping rule
0
0 comments X

The pith

An adaptation of Wald's Sequential Probability Ratio Test serves as a compute governor for multi-agent LLM debates by stopping early when an LLM judge detects sufficient consensus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to replace fixed round counts in LLM debates with a sequential stopping rule based on accumulating evidence from consensus scores. After each round an LLM judge scores agreement on a [0,1] scale; a monitor builds a log-likelihood ratio under a Beta model and halts when the ratio crosses a threshold favoring useful convergence or the opposite. This inherits classical error-rate bounds under independence assumptions and uses a small calibration set to check whether the judge scores actually distinguish productive debate from stalled progress. A reader cares because fixed-round debate wastes calls on easy questions and still fails on hard ones, while the sequential rule cuts average cost sharply on arithmetic tasks with only small accuracy trade-offs.

Core claim

The central claim is that Wald's SPRT, applied to successive [0,1] consensus scores emitted by an LLM judge after each debate round and modeled with a Beta likelihood, yields a stopping rule that stops early on items where agents converge usefully and caps at a maximum round count otherwise, inheriting type-I and type-II error guarantees from the classical test under i.i.d. score assumptions; in practice the calibration step on disjoint data is the practical object because it quantifies how well the judge separates the two regimes in the target domain.

What carries the argument

Wald's Sequential Probability Ratio Test (SPRT) adapted as a compute governor: it accumulates the log-likelihood ratio of 'useful convergence' versus 'not yet useful' under a Beta family for the observed consensus scores and crosses an upper boundary to accept convergence or a lower boundary to reject it.

If this is right

  • On GSM8K items the rule averages 1.01 rounds and 4.06 LLM calls while reaching 97% accuracy, compared with 5 rounds and 15 calls at 99% for fixed debate.
  • The method returns a capped best-effort outcome on items where the ratio never crosses either boundary within R_max rounds.
  • Calibration on disjoint subsets estimates the separation quality of the judge's consensus score for the given domain and model set.
  • Under the i.i.d. model the procedure controls type-I and type-II error rates at user-chosen levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same governor could be attached to other multi-step LLM workflows that produce intermediate quality signals.
  • When calibration indicates the judge score carries no information the rule simply caps on nearly every item, which is a conservative default.
  • Classical sequential analysis supplies lightweight failure detection without requiring new training or fine-tuning of the agents or judge.

Load-bearing premise

The assumption that successive consensus scores are independent and identically distributed so that the accumulated log-likelihood ratio behaves as in the classical SPRT, plus the assumption that calibration on a small disjoint set can determine whether those scores separate useful from unhelpful debate progress.

What would settle it

Run the debate on a large set of items where the judge's consensus scores show no statistical separation between early-stop and full-round outcomes; if the rule then either stops too early on incorrect answers or caps on almost everything, the calibration-based separation claim is falsified.

Figures

Figures reproduced from arXiv: 2605.19193 by Andrea Morandi.

Figure 1
Figure 1. Figure 1: Schematic of the sequential-consensus orchestrator. After every debate [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Average debate rounds at α = β = 0.05, Rmax = 8 versus the fixed-5 baseline. (b) Per-task SPRT outcome breakdown (consensus / no￾consensus / capped). SPRT machinery, working curves, error rates, and decision￾type behaviour under the calibrated Beta likelihood model from Section V-C. The figures and tables in Sections VI-A– VI-E derive from N = 50,000 Monte-Carlo trajectories per hypothesis at α = β = 0… view at source ↗
Figure 3
Figure 3. Figure 3: Sequential-rule working curves: (a) average rounds versus [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Sensitivity to mis-calibration: avg-rounds (green) and classification [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-LLM evaluation on 200 attempted MMLU (multiple-choice) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes adapting Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for multi-agent LLM debates. After each round an LLM judge emits a [0,1] consensus score; a Wald monitor accumulates the log-likelihood ratio of useful vs. not-yet-useful convergence under a Beta likelihood family and stops when a boundary is crossed or a cap at R_max is reached. The paper claims that under i.i.d. assumptions the procedure inherits classical SPRT type-I and type-II error guarantees. It reports Monte-Carlo simulations under calibrated Beta models together with real-LLM experiments on 200 MMLU and 200 GSM8K items using three heterogeneous agents and a disjoint 40-item calibration set, showing a 3.7x reduction in LLM calls on GSM8K at a 2 pp accuracy cost and frequent capping on MMLU.

Significance. If the empirical behaviour generalises, the work supplies a lightweight, statistically motivated layer for dynamic compute allocation in multi-agent LLM systems. The explicit separation of calibration from deployment and the reporting of both Monte-Carlo working curves and real-LLM outcomes are positive features. The central practical takeaway—that a classical sequential test can serve as a cheap failure-detection and cost-control mechanism—is potentially useful for production LLM pipelines.

major comments (1)
  1. [Abstract / method] Abstract and method description: the claim that the rule 'inherits SPRT type-I/type-II error guarantees' under i.i.d. assumptions is load-bearing for the theoretical contribution. Successive consensus scores are generated after each round conditions on all prior agent positions, inducing serial dependence that violates the independent-increments property required for the classical SPRT bounds. The Monte-Carlo study draws i.i.d. Beta variates and therefore cannot diagnose this issue; no autocorrelation diagnostics or adjusted error-rate estimates are supplied for the 200-item GSM8K/MMLU runs.
minor comments (2)
  1. [Abstract] Abstract: numerical results (1.01 rounds, 97.0 % accuracy, 4.06 calls) are reported without error bars, standard deviations, or exact SPRT boundary values and Beta parameters used in the real-LLM track.
  2. [Evaluation] Evaluation: the manuscript should clarify the precise definition of 'useful convergence' that underlies the Beta likelihood and how the 40-item calibration subsets are used to set the decision boundaries.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and precise identification of the independence assumption. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / method] Abstract and method description: the claim that the rule 'inherits SPRT type-I/type-II error guarantees' under i.i.d. assumptions is load-bearing for the theoretical contribution. Successive consensus scores are generated after each round conditions on all prior agent positions, inducing serial dependence that violates the independent-increments property required for the classical SPRT bounds. The Monte-Carlo study draws i.i.d. Beta variates and therefore cannot diagnose this issue; no autocorrelation diagnostics or adjusted error-rate estimates are supplied for the 200-item GSM8K/MMLU runs.

    Authors: We agree that successive consensus scores are serially dependent, as each round's agent outputs condition on prior positions and the judge evaluates the evolving debate state. The manuscript qualifies the SPRT inheritance claim with the explicit phrase 'under i.i.d. assumptions,' framing it as an idealized theoretical reference rather than a strict guarantee for the deployed system. The Monte-Carlo experiments are intentionally conducted under the i.i.d. Beta model to isolate the behavior of the Wald boundaries and calibration procedure; they are not intended to simulate the full dependence structure. In the real-LLM experiments the calibration set directly estimates the separation between 'useful' and 'not-yet-useful' scores from observed data, which remains valid even under dependence. To strengthen the presentation we will (i) revise the abstract and method sections to state that the classical bounds are approximate under serial dependence, (ii) add lag-1 autocorrelation coefficients and partial autocorrelation plots for the judge scores on the GSM8K and MMLU runs, and (iii) report empirical type-I and type-II error proxies obtained by treating the fixed-round-5 outcome as ground truth. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard SPRT under explicit assumptions with external calibration

full rationale

The paper adapts Wald's classical SPRT as a stopping rule for LLM debate rounds, modeling consensus scores with a Beta likelihood and stopping on log-likelihood ratio boundaries. It explicitly states that type-I/II error guarantees hold only under i.i.d. assumptions on the scores and emphasizes that calibration on disjoint 40-item subsets is the key practical component for estimating whether scores separate useful vs. unhelpful convergence. The Monte-Carlo track simulates directly from the calibrated Beta model, while the real-LLM track (200 MMLU + 200 GSM8K items) applies the rule after calibration on held-out data and reports empirical accuracy and cost. No equation or claim reduces the inherited guarantees, the stopping behavior, or the reported performance metrics to a fitted parameter by construction; the i.i.d. assumption is presented as a modeling choice rather than derived from the target data, and all performance numbers come from direct measurement on separate items.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard mathematical properties of SPRT plus domain assumptions about LLM consensus scores; calibration introduces fitted elements whose independence from the test data is only partially addressed by the disjoint-subset design.

free parameters (2)
  • SPRT decision boundaries
    Thresholds that trigger early stopping or capping are chosen or calibrated and directly determine the operating point reported in the GSM8K and MMLU results.
  • Beta likelihood parameters
    Parameters of the Beta family used to model the [0,1] consensus scores are part of the likelihood ratio computation and are calibrated on the 40-item subsets.
axioms (1)
  • domain assumption i.i.d. assumptions on successive consensus scores across rounds
    Invoked explicitly to inherit SPRT type-I and type-II error guarantees.

pith-pipeline@v0.9.0 · 5897 in / 1648 out tokens · 54729 ms · 2026-05-20T11:29:35.953889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing (ICML), 2024, arXiv:2305.14325

  2. [2]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, arXiv:2305.19118

  3. [3]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations (ICLR), 2023, arXiv:2203.11171

  4. [4]

    Wald,Sequential Analysis

    A. Wald,Sequential Analysis. John Wiley and Sons, 1947

  5. [5]

    Discrete sequential boundaries for clinical trials,

    K. K. G. Lan and D. L. DeMets, “Discrete sequential boundaries for clinical trials,”Biometrika, vol. 70, no. 3, pp. 659–663, 1983

  6. [6]

    Peeking at A/B tests: Why it matters, and what to do about it,

    R. Johari, P. Koomen, L. Pekelis, and D. Walsh, “Peeking at A/B tests: Why it matters, and what to do about it,” inProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017

  7. [7]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023, arXiv:2306.05685

  8. [8]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” inInternational Conference on Learning Representations (ICLR), 2021, arXiv:2009.03300

  9. [9]

    Training verifiers to solve math word problems,

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021

  10. [10]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2201.11903

  11. [11]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2305.10601

  12. [12]

    A. Morandi, “RTLC – research, teach-to-learn, critique: A three-stage prompting paradigm inspired by the Feynman learning technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning,” 2026. [Online]. Available: https://arxiv.org/abs/2605.13695