Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

Junseok Kim; Kyomin Jung; Kyungmin Min; Nakyeong Yang

arxiv: 2601.02970 · v2 · submitted 2026-01-06 · 💻 cs.CL · cs.LG

Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

Junseok Kim , Nakyeong Yang , Kyungmin Min , Kyomin Jung This is my paper

Pith reviewed 2026-05-16 17:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords self-consistencyadaptive samplingLLM reasoningconfidence estimationinference efficiencyearly stoppingreliability-aware aggregation

0 comments

The pith

ReASC shifts LLM self-consistency from counting matching answers to checking when accumulated confidence provides enough evidence, cutting sampling costs while holding accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard self-consistency runs many samples and takes the majority vote, but this often wastes compute on questions that need fewer draws. ReASC instead uses the model's own per-response to decide whether one answer suffices or more are required, then weights the final answer by both how often each option appears and how confident the model was in each. The method splits into an initial single-sample check followed by a reliability-aware accumulation step that stops once evidence strength meets an internal threshold. Across five models and four datasets the approach matches full self-consistency accuracy at substantially lower average sample counts. Readers should care because inference cost is the main barrier to deploying reliable multi-step reasoning at scale.

Core claim

ReASC reframes adaptive sampling from response counting to evidence sufficiency by leveraging response-level confidence for principled information aggregation. It operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets ReASC consistently achieves the best accuracy-cost trade-off, reducing inference cost by up to 70 percent relative to self-consistency while preserving accuracy.

What carries the argument

Reliability-aware accumulation stage that jointly weights responses by frequency and per-response confidence scores instead of treating every sample as equal.

If this is right

High-confidence single responses allow immediate acceptance and avoid extra sampling on easy instances
Joint frequency-confidence weighting reduces the impact of occasional low-quality but frequent answers
The same two-stage logic scales without modification from 3B to 27B parameter models
No additional training or fine-tuning is required, only access to token-level or response-level scores

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better-calibrated models would amplify ReASC's efficiency gains without any change to the algorithm
The same evidence-sufficiency framing could replace majority voting inside tree-of-thought or other multi-path search methods
Real-time applications could see lower tail latency because average sample count drops on the majority of queries

Load-bearing premise

The LLM's per-response confidence scores must be sufficiently calibrated and correlated with actual correctness rather than being overconfident or random.

What would settle it

An experiment on a dataset where the model's per-response scores show zero or negative correlation with correctness would eliminate any accuracy or cost advantage for ReASC over count-based adaptive baselines.

read the original abstract

Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReASC adds a two-stage confidence-weighted accumulation to adaptive self-consistency and reports clear efficiency gains, but the calibration of the confidence scores remains unexamined in the abstract.

read the letter

ReASC takes the standard adaptive self-consistency setup and replaces pure count-based stopping with a two-stage process: a quick single-sample check for high-confidence cases, followed by accumulation that weights answers by both their frequency and the model's per-response . This is a direct but useful shift from earlier count-only adaptive methods, and the paper shows it across five models (3B to 27B) and four datasets with consistent accuracy-cost improvements. The headline number is a claimed 70% cost drop on GSM8K with Gemma-3-4B-it at unchanged accuracy, which would matter for anyone running reasoning workloads at scale. The experiments are broad enough to give the efficiency claim some grounding, and the framing around evidence sufficiency rather than raw counts is a clean conceptual step. The soft spot is the confidence signal. The method stands or falls on whether those scores are calibrated and more informative than counts alone; if the model is overconfident on wrong answers, early stopping and weighted aggregation could amplify errors. The abstract supplies no ECE numbers, Brier scores, or ablation that isolates the contribution, so it is still unclear how much of the reported gain comes from the new component versus the adaptive skeleton itself. This paper is for groups already using self-consistency on math and logic tasks who want to trim sampling budgets. Readers who know the prior adaptive papers will see the incremental move immediately. I would bring it to a reading group to check the exact aggregation formula and any calibration plots. It deserves peer review because the empirical scope is reasonable and the efficiency numbers are large enough to test further.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Reliability-Aware Adaptive Self-Consistency (ReASC), which augments standard self-consistency by using per-response LLM confidence scores in a two-stage process: a single-sample stage that resolves high-confidence instances early and a reliability-aware accumulation stage that weights responses by both frequency and confidence. Across five models (3B–27B) and four datasets, ReASC is claimed to deliver the best accuracy-cost trade-off, including up to 70% inference-cost reduction versus vanilla self-consistency on GSM8K with Gemma-3-4B-it while preserving accuracy.

Significance. If the central empirical claim holds after verification of confidence calibration, the method offers a practical route to lower inference budgets for multi-sample reasoning without sacrificing reliability. The two-stage design and joint frequency-confidence weighting are straightforward extensions of existing adaptive self-consistency work and could be adopted in production pipelines if the calibration assumption is shown to be robust.

major comments (3)

[§3.2 and §4.2] §3.2 (single-sample decision stage) and §4.2 (results): the paper reports accuracy-cost improvements but supplies no calibration diagnostics (ECE, Brier score, or reliability diagrams) for the LLM-generated confidence scores used for early stopping. Without these metrics the 70% cost-reduction claim on GSM8K cannot be distinguished from over-confident early termination on incorrect answers.
[§4.3] §4.3 (ablation studies): no ablation isolates the contribution of the confidence-weighted aggregation from the adaptive sampling skeleton itself. The reported gains could be driven primarily by the early-stopping rule rather than the joint frequency-confidence weighting, weakening the claim that the reliability-aware component is essential.
[§4.1] §4.1 (experimental setup): the manuscript does not report statistical significance tests or error bars on the accuracy-cost curves across the five models and four datasets. The “consistently best” claim therefore rests on point estimates whose variability is unknown.

minor comments (2)

[§3.3] Notation for the joint weighting function in §3.3 is introduced without an explicit equation number; adding an equation label would improve traceability.
[§3.2] The threshold-selection procedure for the single-sample stage is described only at a high level; a short pseudocode block would clarify reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps that can be addressed without misrepresenting our results, we commit to revisions in the next version of the manuscript.

read point-by-point responses

Referee: [§3.2 and §4.2] §3.2 (single-sample decision stage) and §4.2 (results): the paper reports accuracy-cost improvements but supplies no calibration diagnostics (ECE, Brier score, or reliability diagrams) for the LLM-generated confidence scores used for early stopping. Without these metrics the 70% cost-reduction claim on GSM8K cannot be distinguished from over-confident early termination on incorrect answers.

Authors: We agree that explicit calibration diagnostics would strengthen the interpretation of the single-sample early-stopping stage. Although the fact that accuracy is preserved (rather than degraded) across five models and four datasets provides indirect evidence against systematic over-confident termination on errors, we accept that this is insufficient. In the revised manuscript we will add Expected Calibration Error (ECE), Brier scores, and reliability diagrams computed on the confidence scores used for the single-sample decision stage, reported per dataset and model scale. revision: yes
Referee: [§4.3] §4.3 (ablation studies): no ablation isolates the contribution of the confidence-weighted aggregation from the adaptive sampling skeleton itself. The reported gains could be driven primarily by the early-stopping rule rather than the joint frequency-confidence weighting, weakening the claim that the reliability-aware component is essential.

Authors: The referee correctly notes that our existing ablations compare ReASC against external baselines but do not isolate the effect of the joint frequency-confidence weighting from the adaptive sampling rule. We will add a new controlled ablation in the revised version: an adaptive-sampling-only variant that uses the same two-stage skeleton and early-stopping threshold but performs aggregation by frequency alone (no confidence weighting). Direct comparison of this variant against full ReASC will quantify the incremental contribution of the reliability-aware weighting. revision: yes
Referee: [§4.1] §4.1 (experimental setup): the manuscript does not report statistical significance tests or error bars on the accuracy-cost curves across the five models and four datasets. The “consistently best” claim therefore rests on point estimates whose variability is unknown.

Authors: We acknowledge that single-run point estimates limit the strength of the comparative claims. Because of the substantial compute required to repeat every configuration, we will add error bars (standard deviation across three random seeds) and paired statistical significance tests for the primary accuracy-cost results on GSM8K and the largest model scale in the revision. For the remaining datasets and model sizes we will note that results are single-run but consistent in direction. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes ReASC as a two-stage extension of standard self-consistency that incorporates an external response-level confidence signal for early stopping and weighted aggregation. No equations, derivations, or self-citations are shown that reduce the claimed accuracy-cost improvements (such as the 70% cost reduction on GSM8K) to fitted parameters, self-referential quantities, or prior author results by construction. The central claims rest on empirical evaluation across five models and four datasets rather than any internal reduction of the method to its inputs, leaving the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that model confidence scores correlate with answer correctness sufficiently to guide sampling decisions; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption LLM per-response confidence scores are sufficiently calibrated to serve as evidence for early stopping and weighted aggregation
Invoked to justify the single-sample decision stage and the reliability-aware accumulation stage

pith-pipeline@v0.9.0 · 5484 in / 1274 out tokens · 81600 ms · 2026-05-16T17:43:59.678715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReASC … aggregates responses by jointly leveraging their frequency and confidence … confidence-weighted Beta update … v(yi) ← v(yi) + max(1,exp(λz(yi)))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stage 1 … S(y) ≥ τgate … Stage 2 … P(p1 > p2 | V) = 1 − I1/2(α,β)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.