pith. sign in

arxiv: 2511.00751 · v2 · submitted 2025-11-02 · 💻 cs.AI · cs.CL

Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs

Pith reviewed 2026-05-18 02:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords self-consistencydiminishing returnsinference costsreasoning pathslarge language modelsmajority votingHotpotQAMATH-500
0
0 comments X

The pith

Self-consistency sampling now yields only tiny accuracy gains while token costs rise almost linearly with the number of paths tried.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-consistency samples multiple reasoning paths from a language model and selects the answer that appears most often. This paper tests the method on Gemini 2.5 models using HotpotQA and MATH-500 and finds that accuracy improves by just 0.4 percent on HotpotQA and 1.6 percent on MATH-500 even after twenty samples. At the same time the number of tokens processed grows steadily with each added path. Performance stops rising after a few samples and sometimes falls when too many paths are included, because later paths tend to add conflicting answers rather than reinforce the correct one. The authors therefore argue that the technique should be used only on problems where a single pass from the model is unreliable.

Core claim

Self-consistency was built for models that made frequent unpredictable errors, yet on today's stronger models the added reasoning paths produce only marginal accuracy lifts while token consumption scales nearly linearly. On HotpotQA the gain across twenty samples is 0.4 percent and on MATH-500 it is 1.6 percent. Accuracy plateaus early and can decline at higher sample counts because extra paths introduce noise instead of new signal once the model already solves the problem reliably in a single pass.

What carries the argument

self-consistency sampling of multiple reasoning paths followed by frequency-based answer selection

If this is right

  • Token usage grows almost linearly as the number of sampled reasoning paths increases.
  • Accuracy improvements stop after only a few additional samples on the tested tasks.
  • High sample counts can reduce accuracy in some configurations by injecting conflicting answers.
  • Indiscriminate multi-path sampling becomes harder to justify once inference costs scale with model size.
  • Multi-path methods should be applied selectively to problems that exceed a model's single-pass reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work may shift focus from post-hoc sampling to methods that improve single-pass reliability directly.
  • Similar cost-benefit patterns could appear in other multi-sample techniques such as beam search or ensembles.
  • Practical systems could add early stopping rules that halt sampling once answer confidence is high.

Load-bearing premise

The results obtained with Gemini 2.5 models on HotpotQA and MATH-500 are representative of how self-consistency behaves across other strong current models and typical reasoning tasks.

What would settle it

A replication on several frontier models that records continued accuracy growth beyond ten samples on a wider range of benchmarks without later performance drops would contradict the reported plateau and noise effect.

Figures

Figures reproduced from arXiv: 2511.00751 by Chiyan Loo.

Figure 1
Figure 1. Figure 1: Gemini-2.5-Flash-Lite accuracy and cost on HotpotQA across varying numbers of agents. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gemini-2.5-Flash-Lite accuracy and cost on Math-500 across varying numbers of agents. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gemini-2.5-Pro accuracy and cost on Math-500 across up to 15 agents. Accuracy im [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate self-consistency is difficult to justify. We recommend reserving multi-path sampling for problems that demonstrably exceed a model's single-pass reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that self-consistency—sampling multiple reasoning paths and selecting the majority answer—has become inefficient for strong modern LLMs. Using Gemini 2.5 models on HotpotQA and MATH-500, it reports minimal accuracy improvements (0.4% and 1.6% across 20 samples) that plateau early or even decline at higher sample counts, while token costs increase nearly linearly, and recommends restricting multi-path sampling to problems where single-pass reliability is low.

Significance. If the empirical pattern holds, the result has clear practical value for inference-cost optimization: it supplies concrete evidence that a once-standard technique now yields diminishing returns on frontier models, supporting more selective deployment of sampling-based methods and highlighting the need to match reasoning strategies to per-model reliability.

major comments (2)
  1. [Abstract / Results] Abstract and experimental results: the reported accuracy deltas (0.4% on HotpotQA, 1.6% on MATH-500) are presented without error bars, confidence intervals, or statistical significance tests, and the protocol does not specify the number of independent runs or variance across seeds; this directly affects the reliability of the central claim that gains are minimal and that performance can decline.
  2. [Experiments] Experimental scope: all measurements are confined to Gemini 2.5 variants on only HotpotQA and MATH-500; no ablations or comparisons appear for other frontier models (e.g., GPT-4o, Claude 3.5) or additional reasoning benchmarks (e.g., GSM8K, GPQA), so the generalization that self-consistency is “losing its edge across modern LLMs” rests on an untested extrapolation that is load-bearing for the recommendation.
minor comments (1)
  1. [Results] The linear cost scaling is unsurprising but would benefit from an explicit plot or table showing tokens per sample count to make the cost-benefit trade-off visually immediate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and experimental results: the reported accuracy deltas (0.4% on HotpotQA, 1.6% on MATH-500) are presented without error bars, confidence intervals, or statistical significance tests, and the protocol does not specify the number of independent runs or variance across seeds; this directly affects the reliability of the central claim that gains are minimal and that performance can decline.

    Authors: We agree that the lack of error bars and statistical tests limits the strength of the presented claims. In the revised manuscript we have rerun the primary experiments across five independent random seeds, added standard-error bars to all accuracy plots and tables, and included a full description of the experimental protocol (including seed count and variance) in the Methods section. We also report the results of paired statistical tests confirming that the small observed gains are not significant at conventional thresholds, which directly bolsters the central argument of minimal returns. revision: yes

  2. Referee: [Experiments] Experimental scope: all measurements are confined to Gemini 2.5 variants on only HotpotQA and MATH-500; no ablations or comparisons appear for other frontier models (e.g., GPT-4o, Claude 3.5) or additional reasoning benchmarks (e.g., GSM8K, GPQA), so the generalization that self-consistency is “losing its edge across modern LLMs” rests on an untested extrapolation that is load-bearing for the recommendation.

    Authors: We acknowledge the limited scope of the evaluation. Gemini 2.5 was selected as a representative frontier model and the two benchmarks were chosen to span multi-hop and mathematical reasoning; however, we do not claim the results are universal. In the revision we have added a dedicated Limitations section that explicitly discusses the narrow model and benchmark coverage, cautions against over-generalization, and recommends that practitioners first measure single-pass reliability on their target model and task. While resource constraints prevent exhaustive testing of every frontier model, the observed pattern of early plateauing and linear cost growth is presented as a case study rather than a definitive proof for all LLMs. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed benchmarks

full rationale

The paper reports experimental accuracy and cost measurements obtained by running self-consistency sampling (varying path counts) on Gemini 2.5 models for HotpotQA and MATH-500. No derivation chain, fitted parameters, or self-citations are used to generate the central claims; the reported 0.4% and 1.6% gains, plateauing behavior, and linear cost scaling are direct observations from the runs. The study is self-contained against external benchmarks and contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements from two specific benchmarks and one model family rather than new theoretical constructs.

axioms (1)
  • domain assumption HotpotQA and MATH-500 are representative benchmarks for evaluating reasoning reliability in modern LLMs.
    The generalization from these two datasets to broader claims about self-consistency depends on this assumption.

pith-pipeline@v0.9.0 · 5694 in / 1361 out tokens · 56059 ms · 2026-05-18T02:03:31.798396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

    cs.LG 2026-05 unverdicted novelty 7.0

    CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Aggarwal, P., Madaan, A., Yang, Y., et al. (2023). Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms.arXiv preprint arXiv:2305.11860

  2. [2]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874

  3. [3]

    Wan, G., Wu, Y., Chen, J., & Li, S. (2024). Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling.arXiv preprint arXiv:2408.17017

  4. [4]

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171

  5. [5]

    V., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems,35, 24824–24837

  6. [6]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600. 7