Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs
Pith reviewed 2026-05-18 02:03 UTC · model grok-4.3
The pith
Self-consistency sampling now yields only tiny accuracy gains while token costs rise almost linearly with the number of paths tried.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-consistency was built for models that made frequent unpredictable errors, yet on today's stronger models the added reasoning paths produce only marginal accuracy lifts while token consumption scales nearly linearly. On HotpotQA the gain across twenty samples is 0.4 percent and on MATH-500 it is 1.6 percent. Accuracy plateaus early and can decline at higher sample counts because extra paths introduce noise instead of new signal once the model already solves the problem reliably in a single pass.
What carries the argument
self-consistency sampling of multiple reasoning paths followed by frequency-based answer selection
If this is right
- Token usage grows almost linearly as the number of sampled reasoning paths increases.
- Accuracy improvements stop after only a few additional samples on the tested tasks.
- High sample counts can reduce accuracy in some configurations by injecting conflicting answers.
- Indiscriminate multi-path sampling becomes harder to justify once inference costs scale with model size.
- Multi-path methods should be applied selectively to problems that exceed a model's single-pass reliability.
Where Pith is reading between the lines
- Future work may shift focus from post-hoc sampling to methods that improve single-pass reliability directly.
- Similar cost-benefit patterns could appear in other multi-sample techniques such as beam search or ensembles.
- Practical systems could add early stopping rules that halt sampling once answer confidence is high.
Load-bearing premise
The results obtained with Gemini 2.5 models on HotpotQA and MATH-500 are representative of how self-consistency behaves across other strong current models and typical reasoning tasks.
What would settle it
A replication on several frontier models that records continued accuracy growth beyond ten samples on a wider range of benchmarks without later performance drops would contradict the reported plateau and noise effect.
Figures
read the original abstract
Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate self-consistency is difficult to justify. We recommend reserving multi-path sampling for problems that demonstrably exceed a model's single-pass reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that self-consistency—sampling multiple reasoning paths and selecting the majority answer—has become inefficient for strong modern LLMs. Using Gemini 2.5 models on HotpotQA and MATH-500, it reports minimal accuracy improvements (0.4% and 1.6% across 20 samples) that plateau early or even decline at higher sample counts, while token costs increase nearly linearly, and recommends restricting multi-path sampling to problems where single-pass reliability is low.
Significance. If the empirical pattern holds, the result has clear practical value for inference-cost optimization: it supplies concrete evidence that a once-standard technique now yields diminishing returns on frontier models, supporting more selective deployment of sampling-based methods and highlighting the need to match reasoning strategies to per-model reliability.
major comments (2)
- [Abstract / Results] Abstract and experimental results: the reported accuracy deltas (0.4% on HotpotQA, 1.6% on MATH-500) are presented without error bars, confidence intervals, or statistical significance tests, and the protocol does not specify the number of independent runs or variance across seeds; this directly affects the reliability of the central claim that gains are minimal and that performance can decline.
- [Experiments] Experimental scope: all measurements are confined to Gemini 2.5 variants on only HotpotQA and MATH-500; no ablations or comparisons appear for other frontier models (e.g., GPT-4o, Claude 3.5) or additional reasoning benchmarks (e.g., GSM8K, GPQA), so the generalization that self-consistency is “losing its edge across modern LLMs” rests on an untested extrapolation that is load-bearing for the recommendation.
minor comments (1)
- [Results] The linear cost scaling is unsurprising but would benefit from an explicit plot or table showing tokens per sample count to make the cost-benefit trade-off visually immediate.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions made.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and experimental results: the reported accuracy deltas (0.4% on HotpotQA, 1.6% on MATH-500) are presented without error bars, confidence intervals, or statistical significance tests, and the protocol does not specify the number of independent runs or variance across seeds; this directly affects the reliability of the central claim that gains are minimal and that performance can decline.
Authors: We agree that the lack of error bars and statistical tests limits the strength of the presented claims. In the revised manuscript we have rerun the primary experiments across five independent random seeds, added standard-error bars to all accuracy plots and tables, and included a full description of the experimental protocol (including seed count and variance) in the Methods section. We also report the results of paired statistical tests confirming that the small observed gains are not significant at conventional thresholds, which directly bolsters the central argument of minimal returns. revision: yes
-
Referee: [Experiments] Experimental scope: all measurements are confined to Gemini 2.5 variants on only HotpotQA and MATH-500; no ablations or comparisons appear for other frontier models (e.g., GPT-4o, Claude 3.5) or additional reasoning benchmarks (e.g., GSM8K, GPQA), so the generalization that self-consistency is “losing its edge across modern LLMs” rests on an untested extrapolation that is load-bearing for the recommendation.
Authors: We acknowledge the limited scope of the evaluation. Gemini 2.5 was selected as a representative frontier model and the two benchmarks were chosen to span multi-hop and mathematical reasoning; however, we do not claim the results are universal. In the revision we have added a dedicated Limitations section that explicitly discusses the narrow model and benchmark coverage, cautions against over-generalization, and recommends that practitioners first measure single-pass reliability on their target model and task. While resource constraints prevent exhaustive testing of every frontier model, the observed pattern of early plateauing and linear cost growth is presented as a case study rather than a definitive proof for all LLMs. revision: partial
Circularity Check
No circularity: direct empirical measurements on fixed benchmarks
full rationale
The paper reports experimental accuracy and cost measurements obtained by running self-consistency sampling (varying path counts) on Gemini 2.5 models for HotpotQA and MATH-500. No derivation chain, fitted parameters, or self-citations are used to generate the central claims; the reported 0.4% and 1.6% gains, plateauing behavior, and linear cost scaling are direct observations from the runs. The study is self-contained against external benchmarks and contains no load-bearing steps that reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption HotpotQA and MATH-500 are representative benchmarks for evaluating reasoning reliability in modern LLMs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Reference graph
Works this paper leans on
- [1]
-
[2]
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [3]
-
[4]
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems,35, 24824–24837
work page 2022
-
[6]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.