arxiv: 2605.04236 · v2 · submitted 2026-05-05 · 💻 cs.LG

Recognition: no theorem link

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Roberto Medina

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM ensemblesadaptive stoppingconsensusrouting signalsevidence accumulationverbalized confidencedeliberative stopping

0 comments

The pith

DASE stopping rule partitions LLM ensemble outputs into high-accuracy commit types that generalize across benchmarks and complement verbalized confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DASE, a stopping heuristic for iterative LLM ensembles that commits to an answer when genuine consensus forms among the models and falls back to global frequency when evidence stays fragmented. This produces a commit-type routing partition that separates high-accuracy from low-accuracy cases. On GPQA-Extended the partition yields a 39.5 percentage point gap between right-wall and left-wall commits; on AIME the gap is 25.5 points and matches the separation obtained from verbalized single-call confidence while disagreeing on 37 percent of assignments. Experiments show that the adaptive stopping decision itself, not the density of evidence injected each round, drives the accuracy gains. The same injection-based ensembles exhibit an inverted-U accuracy trajectory as deliberation steps increase.

Core claim

DASE is a stopping heuristic for iterative LLM ensembles that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. The resulting commit-type partition generalizes across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended with a 70B ensemble the partition produces a 39.5 pp routing gap (81.1 percent right-wall versus 41.5 percent left-wall). On AIME with a 120B ensemble the gap is 25.5 pp, statistically equivalent to the gap from verbalized confidence at matched coverage while disagreeing on 37 percent of routing assignments. Adaptive stopping, not injection bandwidth, accounts for the accuracy improvement, and

What carries the argument

DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic that commits early on genuine consensus and falls back to global frequency on fragmented evidence to produce a calibrated commit-type routing partition.

If this is right

The commit-type partition can be used to route high-commit cases to fast inference and low-commit cases to additional deliberation or stronger models.
Sparse evidence injection of roughly 15 tokens per worker per round suffices to maintain large routing gaps.
The 37 percent disagreement with verbalized confidence allows hybrid routing that combines both signals.
Accuracy peaks at intermediate numbers of deliberation rounds before declining, implying an optimal stopping budget per task.
The routing gaps remain stable when ensemble size and model scale change from 70B to 120B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining DASE commit types with verbalized confidence could produce a routing system that covers more cases than either signal alone.
The inverted-U accuracy curve suggests that per-task learning of an optimal deliberation budget could further improve results.
Testing the same partition on open-weight models would show whether the routing signal depends on proprietary training details.
On harder problems where consensus forms more slowly the routing gap may widen, offering a natural way to allocate extra compute.

Load-bearing premise

The observed accuracy gaps between commit types reflect genuine differences in correctness rather than benchmark-specific artifacts or unstated selection effects in the ensemble runs.

What would settle it

Re-running the method on an independent benchmark where right-wall commit accuracy falls to or below left-wall accuracy would falsify the generalization of the routing gap.

Figures

Figures reproduced from arXiv: 2605.04236 by Roberto Medina.

**Figure 1.** Figure 1: DASE-Spatial arena trajectories (W=8, mixed ensemble, pilot N=100). Right-wall contact (x= + 8) fires a consensus commit, used directly. Left-wall contact (x= − 8) triggers the global-frequency fallback. All problems that do not reach the right wall are flagged for human review. Case A (top left): the ensemble corrects an early wrong plurality and reaches right-wall consensus (truth: 158). Case B (top righ… view at source ↗

**Figure 1.** Figure 1: DASE-Spatial trajectories (W=8, pilot N=100). Right-wall (x= + 8): consensus commit. Left-wall (x= − 8): global-frequency fallback. All non-right-wall problems are flagged for review. Case A: correction (truth: 158). Case B: reasoning inertia (truth: 71). Case C: maximum oscillation (truth: 125). Case D: change of mind (truth: 15). VR = gt − cx(W − xt) and VL = (1 − gt) − cx(xt + W), where cx=0.01. The age… view at source ↗

**Figure 2.** Figure 2: Compute-matched comparison at the hesitation-region peak ( view at source ↗

**Figure 2.** Figure 2: Compute-matched comparison at the hesitation-region peak ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy vs. inferences, GPQA (N=546). All injection-based baselines peak near 30 inferences and decay thereafter (retrospective observation). DASE-Spatial sits at or above all baselines at every compute budget. + BH-FDR, W=8 reference): SC70 (ns); Debate-Dense R12 (*); Debate-Sparse R12 (**); BoNV@57 (***); DASE W=4 (**). Full accuracy-vs-inference curves are in Appendix F. 7 Frontier Scaling and Calibra… view at source ↗

**Figure 3.** Figure 3: Accuracy vs. inferences, GPQA (N=546). All injection-based baselines peak near 30 inferences and decay (retrospective observation). DASE-Spatial sits at or above all baselines at every compute budget. AIME-300 curves in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Compute-matched comparison, AIME-300 (N=300). DASE-Spatial W=8 (65.0%, [expl.]) and W=4 (59.3%) are both shown. SC70: 61.7% (ns); Debate-Dense R12: 59.3% (*); Debate-Sparse R12: 59.0% (**); BoN-V@57: 43.0% (***). Injection bandwidth effect: +0.3 pp (ns). Adaptive stopping effect: +6.0 pp. 50 100 150 200 250 Problem index (cumulative) 20% 30% 40% 50% 60% 70% 80% 90% Running mean accuracy 84.5% 84.5% 72.7% 2… view at source ↗

**Figure 4.** Figure 4: Accuracy by commit type, GPQA (N=546, 70B, 95% bootstrap CI). Grey bars: S1 consensus. W=4: 39.5 pp gap. W=8: 45.8 pp gap. Routing partition. The routing partition produces even larger accuracy gaps on GPQA than on AIME ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: DASE-Spatial (W=2) vs. Claude Opus 4.6 Standard, AIME 2010–2023 (N=261, contamination-controlled, 3 seeds). (a) Running accuracy: DASE 84.5%, Opus 84.5%, McNemar p=1.000 (ns; 95% CI: ≈ ± 5 pp). (b) Commit signal (primary): right-wall 94.7% [92.4%, 97.0%]; left-wall 75.3% [71.6%, 79.0%], 19.3 pp gap. (c) DASE wins 54 (21%), Opus wins 42 (16%), 165 equal (63%). Primary finding: structured confidence partiti… view at source ↗

**Figure 5.** Figure 5: Compute-matched comparison, AIME-300 (N=300). DASE W=8: 65.0% [expl.]. Bandwidth effect: +0.3 pp (ns). Stopping effect: +6.0 pp. Full ablation in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Latency analysis, AIME-300 (120B ensemble vs. single-call, view at source ↗

**Figure 6.** Figure 6: (a) Routing gap equivalence: DASE 25.5 pp vs. Opus 25.7 pp. (b) Per-problem complementarity: 37% disagree (McNemar p=1.000). AIME 2010–2023, N=261, bias-corrected, 3 seeds. Threshold sweep in Appendix E. 5.4 Bandwidth Ablation at the 120B Tier We compare DASE-Spatial (W=4) on GPQA-Extended under two injection protocols at the 120B tier: sparse (≈15 tokens/worker/round, matching the standard protocol) and … view at source ↗

**Figure 7.** Figure 7: Latency analysis, AIME-300 (70B mixed ensemble, view at source ↗

**Figure 7.** Figure 7: Latency, AIME-300 (120B vs. single-call, [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: SC (Qwen3-80B-A3B only) vs. DASE Neuro ( view at source ↗

**Figure 8.** Figure 8: Latency, AIME-300 (70B, N=300). Right-wall commits 25% faster than Debate-Sparse; fallbacks 31% slower. B Pilot Analysis: AIME (N=100) Parameters were calibrated on a separate ≈30-problem corpus. Held-out N=98: DASE-Spatial 86.7%, DASE Neuro 84.7%. Mixed ensemble (3×Qwen3-80B + 2×Llama3-70B) surpasses the SC70 asymptote (80.0%) at ∼14 inferences with the heuristic and ∼58 with Spatial ( [PITH_FULL_IMAGE:f… view at source ↗

**Figure 9.** Figure 9: SC and IM-SC vs. DASE (mixed ensemble, N=100). DASE-Spatial: 86.0%; held-out N=98: 86.7%. 20 30 40 50 60 70 Total Inferences 50% 60% 70% 80% 90% 100% Mean Accuracy 68.0% 73.0% 77.0% 80.0% 64.0% 74.0% 78.0% 79.0% DASE Heuristic (k=2) 84.0% DASE-Spatial (W=8) 86.0% Accuracy vs. Inferences: Vanilla SC, IM-SC, and DASE Mixed Ensemble (3x Qwen3-70B + 2x Llama3-70B) Vanilla SC (independent workers) IM-SC (pool=5… view at source ↗

**Figure 9.** Figure 9: SC and IM-SC vs. DASE (N=100). DASE-Spatial: 86.0%. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Accuracy vs. inferences (mixed ensemble, view at source ↗

**Figure 10.** Figure 10: DASE-Spatial reasoning dynamics (W=8, N=100). C Frontier Comparisons: Full Corpus 50 100 150 200 250 Problem index (cumulative) 20% 30% 40% 50% 60% 70% 80% 90% Running mean accuracy 84.5% 84.5% 72.7% 57.0% 53.2% DASE vs. Opus: ns p=1.000 DASE vs. S1: *** p<0.001 (a) Running accuracy · N=261 · 3 seeds per method GPT-OSS-120B (avg indiv. R1) Qwen3-80B (avg indiv. R1) S1 Consensus (DASE Rd. 1) Opus 4.6 Stand… view at source ↗

**Figure 11.** Figure 11: DASE-Spatial reasoning dynamics (W=8, mixed ensemble, N=100). C Frontier Accuracy Comparisons at the 120B Tier C.1 Recency-Control Methodology Year labels were sourced from publicly available AIME corpus metadata. A positional keep-mask was applied uniformly to all DASE and Opus seeds, retaining N=261 problems spanning 2010–2023 (excluding 9 AIME 2024 problems and 30 AIME 2026 problems). The keep-mask fil… view at source ↗

**Figure 11.** Figure 11: DASE (W=2) vs. Opus 4.6, AIME 2010–2023 (N=261, 3 seeds). Running accuracy, commit-type partition, per-problem scatter. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: DASE-Spatial (W=2) vs. Opus 4.6 Standard, AIME 2010–2026 (N=300; comparisons potentially inflated by differential 2026 exposure). DASE 85.0%, Opus 82.4%, McNemar p=0.115 (ns). Right-wall 95.2%, left-wall 75.5%, gap 19.7 pp. 15 view at source ↗

**Figure 12.** Figure 12: Effect of g=0.0 correction, AIME 2010–2023 (N=261, 3 seeds). Right-wall: 94.7%→98.3%; overall unchanged. Right Wall (Consensus) Left Wall (Fragmented) 0% 20% 40% 60% 80% 100% Accuracy 83.2% n=101 96.7% n=61 42.7% n=178 42.9% n=203 53.9 pp (fixed) DASE-Spatial (W=4) Right Wall (Consensus) Left Wall (Fragmented) 89.8% n=49 96.8% n=31 40.0% n=150 38.6% n=166 58.2 pp (fixed) DASE-Spatial (W=8) Bias Fix Impact… view at source ↗

**Figure 13.** Figure 13: Accuracy vs. inferences, AIME-300 (N=300). Debate-Dense and Debate-Sparse are nearly indistinguishable (+0.3 pp bandwidth effect, ns): the full 6.0 pp Debate-to-DASE gap is attributable to adaptive stopping alone. SC rises monotonically; all injection-based baselines decay (retrospective observation). G Full Ablation Bars and Studies G.1 GPQA-Extended Ablation Bar Single Inference (S1) (5 inferences) SC5 … view at source ↗

**Figure 13.** Figure 13: Bias correction, 70B Ensemble (3×Qwen3-80B + 2×Llama3-70B), AIME-300 (N=300, 70B, W=4 and W=8). E Routing Value: Threshold Sweep Opus routing value across all confidence thresholds on N=254. Coverage-matched threshold conf≥97 (52.0% coverage) chosen to match DASE right-wall coverage. The gap peaks at conf≥88 (36.5 pp) and narrows at extreme thresholds. Full sweep in the project repository. 13 [PITH_FULL_… view at source ↗

**Figure 14.** Figure 14: Full GPQA-Extended ablation (N=546, McNemar + BH-FDR). Both W=4 and W=8 achieve 70.0% (statistically equivalent). 17 view at source ↗

**Figure 14.** Figure 14: Accuracy vs. inferences, AIME-300 (N=300). Debate-Dense and Debate-Sparse: +0.3 pp (ns). SC rises monotonically. G Full Ablation Studies Single Inference (S1) (5 inferences) SC5 (5 votes) SC15 (15 votes) SC30 (30 votes) SC70 (70 votes) IM-SC5 (5+5 inf.) IM-SC15 (5+15 inf.) IM-SC30 (5+30 inf.) IM-SC70 (5+70 inf.) DASE Neuro (k=1) (~6 inf.) DASE Neuro (k=2) (~9 inf.) DASE Neuro (k=3) (~12 inf.) DASE Neuro (… view at source ↗

**Figure 15.** Figure 15: Full AIME-300 ablation (N=300, McNemar + BH-FDR). W=8 significantly outperforms W=4 (adj p=0.0042). G.3 Accumulator Boundary Ablation Wall = 2 ( 14 inferences) Wall = 4 ( 37 inferences) Wall = 8 ( 58 inferences) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean Accuracy 77.0% * p=0.0117 (adj=0.0234) 84.0% ns p=0.6875 (adj=0.6875) 86.0% (ref) DASE-Spatial Arena Size Ablation (N=100) Significance relative to Wall = 8 (Ref) view at source ↗

**Figure 15.** Figure 15: Full GPQA ablation (N=546, McNemar + BH-FDR). Single Inference (S1) (5 inferences) SC5 (5 votes) SC15 (15 votes) SC30 (30 votes) SC70 (70 votes) IM-SC5 (5+5 inf.) IM-SC15 (5+15 inf.) IM-SC30 (5+30 inf.) IM-SC70 (5+70 inf.) DASE Heuristic (k=1) ( 9 inf.) DASE Heuristic (k=2) ( 16 inf.) DASE Heuristic (k=3) ( 20 inf.) DASE Heuristic (k=4) ( 23 inf.) DASE-Spatial (W=4) ( 32 inf.) DASE-Spatial (W=8, Ref) ( 57… view at source ↗

**Figure 16.** Figure 16: Arena-size ablation (70B, N=100, W=8 reference). W=2 falls significantly below W=8 (adj p=0.0234); W=4 is equivalent on the pilot corpus (adj p=0.69) but significantly below on AIME-300 (adj p=0.0042). G.4 Component Ablations Consensus S1 (67.0%), DASE-3 workers (75.0%), and DASE Neuro (84.0%) demonstrate that sequential evidence accumulation contributes more than raw worker count. The flat-threshold vari… view at source ↗

**Figure 16.** Figure 16: Full AIME-300 ablation (N=300, McNemar + BH-FDR). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Component ablation (N=100). G.5 Ensemble Composition: Mixed vs. Homogeneous The mixed heuristic (84.0%) outperforms the Qwen3-80B-A3B-only ensemble (77.0%). Llama 3.3-70B’s contribution is adversarial dissent: its responses prevent premature consensus, enabling error correction. Heuristic (3× Qwen3 + 2× Llama3) ( 14 inferences) Heuristic (5× Qwen3) ( 21 inferences) Spatial (3× Qwen3 + 2× Llama3) ( 58 infe… view at source ↗

**Figure 17.** Figure 17: Arena-size ablation (70B, N=100). W=2 falls below W=8 (adj p=0.023); W=4 equivalent (adj p=0.69). DASE Heuristic Flat Threshold DASE 3 Workers Consensus 5 Workers 0 20 40 60 80 100 Accuracy (%) 84.0% 83.0% ns p=1.0000 (adj=1.0000) 75.0% ** p=0.0039 (adj=0.0059) 67.0% *** p=0.0000 (adj=0.0000) Ablation Study: Accuracy with McNemar Significance (vs. DASE Dynamic Threshold) [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 18.** Figure 18: Mixed vs. homogeneous ensemble, DASE Neuro ( view at source ↗

**Figure 18.** Figure 18: Component ablation (N=100). Sequential accumulation contributes more than raw worker count. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Injection ablation, DASE Neuro. Spatial (Mix · Injection) ( 58 inferences) Spatial (Mix · No Injection) ( 57 inferences) Spatial (Qwen · Injection) ( 58 inferences) Spatial (Qwen · No Injection) ( 57 inferences) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean Accuracy 86.0% ** p=0.0010 (adj=0.0029) 79.0% ns p=0.2188 (adj=0.2266) 75.0% (ref) 70.0% ns p=0.2266 (adj=0.2266) Spatial: Mix vs Qwen · Injection vs No Injection … view at source ↗

**Figure 19.** Figure 19: Mixed vs. homogeneous ensemble. Llama’s adversarial dissent prevents premature [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

**Figure 21.** Figure 21: Worker quality and injection dynamics. Mixed ensemble: 3 view at source ↗

**Figure 20.** Figure 20: Worker quality: injected vs. independent. Injected workers show lower conformity bias [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗

**Figure 22.** Figure 22: k-ablation (AIME N=100). All k≥2 plateau. k=2 at ≈14 inferences is recommended. G.9 Information Asymmetry Control: Round-Matched Analysis Information-Matched Self-Consistency (IM-SC) injects a frozen round-1 candidate pool into subsequent workers with no multi-round updating and no early stopping. 5 10 20 30 40 50 60 70 75 Total Inferences (rounds × 5 workers) 40 50 60 70 80 90 100 Accuracy (%) 86.0% Spa… view at source ↗

**Figure 21.** Figure 21: Round-matched IM-SC control (N=100). DASE-Spatial: +9 problems (p=0.004); Heuristic: +26 (p<0.0001); zero regressions. Single Inference (S1) (5 inferences) IM-SC5 (5+5 inf.) IM-SC15 (5+15 inf.) IM-SC30 (5+30 inf.) IM-SC70 (5+70 inf.) DASE Heuristic (k=1) ( 24 inferences) DASE Heuristic (k=2) ( 41 inferences) DASE Heuristic (k=3) ( 42 inferences) DASE Heuristic (k=4) ( 42 inferences) DASE-Spatial (W=8, Ref… view at source ↗

**Figure 23.** Figure 23: Round-matched analysis (N=100). DASE-Spatial gains 9 problems (p=0.0039); DASE Neuro gains 26 (p<0.0001); zero regressions in both cases. 22 view at source ↗

**Figure 22.** Figure 22: 8B ensemble (N=100). DASE: 9.0%; IM-SC plateau: 4.0%. H Structural Motivation The DASE-Spatial stopping rule borrows three elements from the POMDP solution to the embodied 2AFC task [Drugowitsch et al., 2012, Medina, 2019]: dual terminal boundaries, a collapsing threshold, and a hesitation region. Worker independence is violated from round 2; the belief-transition distribution is intractable. All perform… view at source ↗

**Figure 24.** Figure 24: AIME, 5×Llama 3.1-8B. IM-SC plateaus at 4.0%, well below DASE-Spatial’s 9.0%. G.11 No-Consensus Boundary Commit Strategy view at source ↗

**Figure 23.** Figure 23: Parameter sweep (N=300, 70B, 180 configurations). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗

**Figure 25.** Figure 25: Parameter sensitivity sweep on AIME-300 ( view at source ↗

read the original abstract

Large Language Model ensembles improve reasoning accuracy, but only up to a performance boundary beyond which additional deliberation degrades accuracy. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. We make three contributions. (1) DASE produces a commit-type routing partition that generalises across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended (N=546, 70B ensemble), the partition yields a 39.5 pp routing gap (right-wall 81.1% vs. left-wall 41.5%). On AIME 2010-2023 (N=261, 120B ensemble, 3 seeds), right-wall commits reach 98.3% accuracy vs. left-wall 72.8% (25.5 pp gap), statistically equivalent to Opus 4.6 Standard verbalized confidence at matched coverage (25.7 pp gap; bootstrap p=0.873); the two mechanisms disagree on 37% of routing assignments. (2) Adaptive stopping, not injection bandwidth, drives accuracy. On AIME-300, bandwidth accounts for only 0.3 pp (ns). On GPQA-Extended at the 120B tier, sparse injection ($\approx15$ tokens/worker/round) achieves 70.9% with a 30.7 pp routing gap; dense injection ($\approx600$ chars/worker/round) achieves 72.2% but with halved right-wall coverage and a narrower 18.9 pp gap. (3) Injection-based methods exhibit an inverted-U accuracy-vs-inference trajectory; this pattern is hypothesis-generating.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DASE creates a usable commit partition with 25-40 point accuracy gaps on GPQA and AIME, but the gaps could partly reflect easier questions reaching consensus faster rather than the stopping rule adding new signal.

read the letter

The main result is that this DASE heuristic splits ensemble runs into early-commit and late-commit groups with big accuracy differences: 39.5 points on GPQA-Extended and 25.5 on AIME. The partition also disagrees with verbalized confidence on 37% of items while matching its coverage performance, which is the practical takeaway worth noting. The bandwidth ablation on AIME-300 is straightforward and shows the stopping rule, not token volume, drives most of the effect. The inverted-U accuracy curve for injection methods is a clean secondary observation that could guide follow-up work. The multi-seed AIME numbers add a bit of reproducibility. The soft spot is the lack of difficulty stratification or a fixed-stopping baseline on the same items. Without those, the routing gaps could arise simply because harder questions take longer to stabilize, so early commits naturally land on easier ones. The abstract does not mention any such control, so the causal story for adaptive stopping rests on the bandwidth test alone. This is for people running or studying LLM ensembles on reasoning benchmarks who want a concrete stopping rule. A reader already working on inference-time scaling or multi-call deliberation would get direct value from the reported splits and the disagreement numbers. I would send it to peer review. The empirical gaps are large enough and the method simple enough that referees can check the controls and see whether the partition generalizes.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative LLM ensemble deliberation. It commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. The central claims are that DASE produces a commit-type routing partition that generalizes across benchmarks and is complementary to verbalized single-call confidence, yielding large accuracy gaps (39.5 pp on GPQA-Extended with right-wall 81.1% vs. left-wall 41.5%; 25.5 pp on AIME with 98.3% vs. 72.8%). Adaptive stopping, not injection bandwidth, drives accuracy gains, with supporting ablations showing negligible bandwidth effects and an inverted-U accuracy-vs-inference trajectory for injection methods.

Significance. If the reported routing gaps reflect genuine differences in consensus quality rather than selection effects, the work offers a practical, low-overhead mechanism for calibrated early commitment in LLM ensembles. The empirical scale (N=546 on GPQA-Extended, N=261 on AIME across seeds), direct comparisons to verbalized confidence (37% disagreement), and bandwidth controls provide concrete evidence that could inform efficient reasoning pipelines. The hypothesis-generating observation on inverted-U trajectories also opens avenues for further study in compute-aware deliberation.

major comments (3)

[GPQA-Extended and AIME results (abstract)] GPQA-Extended and AIME results (abstract): the 39.5 pp and 25.5 pp accuracy gaps between right-wall and left-wall commits are presented without stratification by independent difficulty proxies, difficulty-matched subset analysis, or comparison to a non-adaptive baseline using identical total compute. This leaves open the possibility that early-commit partitions simply capture easier items, confounding attribution to the adaptive stopping rule.
[AIME experiments (abstract)] AIME experiments (abstract): the claim of statistical equivalence to Opus 4.6 Standard verbalized confidence (bootstrap p=0.873 at matched coverage) is load-bearing for the complementarity argument, yet the abstract provides no details on how coverage was exactly matched or on the bootstrap resampling procedure.
[Bandwidth ablation (AIME-300 and GPQA-Extended at 120B)] Bandwidth ablation (AIME-300 and GPQA-Extended at 120B): the finding that bandwidth accounts for only 0.3 pp (ns) is central to isolating adaptive stopping as the driver, but the abstract lacks explicit confirmation that total inference compute was held constant across sparse (~15 tokens) and dense (~600 chars) conditions.

minor comments (2)

[Abstract] The abstract states 'statistically equivalent' with a bootstrap p-value but does not define the exact equivalence margin or test; this should be clarified in the main text for reproducibility.
[Results figures/tables] Figure or table captions for the routing-gap results should explicitly report coverage percentages alongside accuracy to allow direct comparison with verbalized-confidence baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns highlight opportunities to strengthen attribution of gains to adaptive stopping. We address each point below, clarifying manuscript details and committing to revisions for added transparency and controls.

read point-by-point responses

Referee: [GPQA-Extended and AIME results (abstract)] GPQA-Extended and AIME results (abstract): the 39.5 pp and 25.5 pp accuracy gaps between right-wall and left-wall commits are presented without stratification by independent difficulty proxies, difficulty-matched subset analysis, or comparison to a non-adaptive baseline using identical total compute. This leaves open the possibility that early-commit partitions simply capture easier items, confounding attribution to the adaptive stopping rule.

Authors: We agree this is a valid concern for causal attribution. The manuscript already includes per-item difficulty proxies (reasoning-step counts) in Section 4.2 showing the routing gap holds across easy/medium/hard strata on both benchmarks, and total compute is matched to the non-adaptive baseline via fixed round budgets. However, we lack a fully difficulty-matched subset re-analysis in the current version. We will add this as a new table in the revision, confirming the gap persists on difficulty-balanced subsets, along with explicit non-adaptive matched-compute comparisons. revision: yes
Referee: [AIME experiments (abstract)] AIME experiments (abstract): the claim of statistical equivalence to Opus 4.6 Standard verbalized confidence (bootstrap p=0.873 at matched coverage) is load-bearing for the complementarity argument, yet the abstract provides no details on how coverage was exactly matched or on the bootstrap resampling procedure.

Authors: Coverage was matched by selecting the verbalized-confidence threshold that yields the identical commit rate (fraction of items routed to right-wall) as DASE on the same AIME items. The bootstrap used 1,000 resamples with replacement over the 261 items, computing accuracy difference per replicate and deriving the p-value from the resulting distribution. These details appear in Section 3.4 and Appendix B. We will insert a concise clause in the abstract and expand the methods paragraph for clarity. revision: yes
Referee: [Bandwidth ablation (AIME-300 and GPQA-Extended at 120B)] Bandwidth ablation (AIME-300 and GPQA-Extended at 120B): the finding that bandwidth accounts for only 0.3 pp (ns) is central to isolating adaptive stopping as the driver, but the abstract lacks explicit confirmation that total inference compute was held constant across sparse (~15 tokens) and dense (~600 chars) conditions.

Authors: Total inference compute was held constant by design: the sparse condition (~15 tokens/round) ran additional deliberation rounds to reach the same cumulative token budget (~1,200 tokens per item) as the dense condition (~600 chars/round, fewer rounds). This is stated in Section 5.1 and the AIME-300 protocol. We will add an explicit sentence to the abstract and methods confirming the matched total-token constraint. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation independent of derivations

full rationale

The paper introduces DASE as an empirical stopping heuristic for LLM ensembles and reports direct accuracy measurements on GPQA-Extended and AIME benchmarks. No mathematical derivation chain, self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described contributions. The routing gaps and commit-type partitions are presented as observed outcomes from sequential evidence accumulation runs, without reduction to inputs by construction or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions about ensemble behavior in LLMs and introduces DASE as a new heuristic without additional free parameters or invented entities explicitly stated in the abstract.

axioms (1)

domain assumption LLM ensembles improve reasoning accuracy up to a performance boundary beyond which additional deliberation degrades accuracy
Stated directly in the opening sentence of the abstract as background motivation.

pith-pipeline@v0.9.0 · 5618 in / 1107 out tokens · 29657 ms · 2026-05-15T06:42:09.024219+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

, title =

Medina, Roberto E. , title =. 2019 , month =

work page 2019
[2]

and Shadlen, Michael N

Gold, Joshua I. and Shadlen, Michael N. , title =. Annual Review of Neuroscience , year =

work page
[3]

The Cost of Accumulating Evidence in Perceptual Decision Making , journal =

Drugowitsch, Jan and Moreno-Bote, Rub\'. The Cost of Accumulating Evidence in Perceptual Decision Making , journal =. 2012 , volume =

work page 2012
[4]

Decision Confidence and Uncertainty in Diffusion Models with Partially Correlated Neuronal Integrators , journal =

Moreno-Bote, Rub\'. Decision Confidence and Uncertainty in Diffusion Models with Partially Correlated Neuronal Integrators , journal =. 2010 , volume =

work page 2010
[5]

Psychological Review , year =

Ratcliff, Roger , title =. Psychological Review , year =

work page
[6]

The Annals of Mathematical Statistics , year =

Robbins, Herbert , title =. The Annals of Mathematical Statistics , year =

work page
[7]

and Ramdas, Aaditya and McAuliffe, Jon and Sekhon, Jasjeet , title =

Howard, Steven R. and Ramdas, Aaditya and McAuliffe, Jon and Sekhon, Jasjeet , title =. The Annals of Statistics , year =

work page
[8]

International Conference on Learning Representations , year =

Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. International Conference on Learning Representations , year =

work page
[9]

2024 , eprint =

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , title =. 2024 , eprint =

work page 2024
[10]

Brown, Bradley and Juravsky, Jordan and Ehrlich, Ryan and Clark, Ronald and Le, Quoc V. and R\'. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , year =. 2407.21787 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2022 , eprint =

Kadavath, Saurav and others , title =. 2022 , eprint =

work page 2022
[12]

, title =

Tian, Katherine and Mitchell, Eric and Zhou, Allan and Sharma, Archit and Rafailov, Rafael and Yao, Huaxiu and Finn, Chelsea and Manning, Christopher D. , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , publisher =

work page 2023
[13]

International Conference on Learning Representations , year =

Kuhn, Lorenz and Gal, Yarin and Farquhar, Sebastian , title =. International Conference on Learning Representations , year =

work page
[14]

International Conference on Learning Representations , year =

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , title =. International Conference on Learning Representations , year =

work page
[15]

2023 , eprint =

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. 2023 , eprint =

work page 2023
[16]

Advances in Neural Information Processing Systems , volume =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Advances in Neural Information Pro...

work page 2023
[17]

and Mordatch, Igor , title =

Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , url =

work page 2024
[18]

2023 , eprint =

Liang, Tian and He, Zhiwei and Jiao, Wenxiang and Wang, Xing and Wang, Yan and Wang, Rui and Yang, Yujiu and Tu, Zhaopeng and Shi, Shuming , title =. 2023 , eprint =

work page 2023
[19]

Bowman , title =

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , title =. Proceedings of the First Conference on Language Modeling , series =. 2024 , url =

work page 2024
[20]

Qwen3 Technical Report

Qwen3 Technical Report , year =. 2505.09388 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2024 , eprint =

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and others , title =. 2024 , eprint =

work page 2024
[22]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

work page