Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Aleksandr Rubashevskii; Artem Shelmanov; Ekaterina Fadeeva; Maiya Goloburda; Maxim Panov; Mrinmaya Sachan; Preslav Nakov; Roman Vashurin

arxiv: 2512.09538 · v2 · submitted 2025-12-10 · 📊 stat.ML · cs.CL· cs.LG

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Ekaterina Fadeeva , Maiya Goloburda , Aleksandr Rubashevskii , Roman Vashurin , Artem Shelmanov , Preslav Nakov , Mrinmaya Sachan , Maxim Panov This is my paper

Pith reviewed 2026-05-16 23:47 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords uncertainty quantificationlarge language modelsbeam searchconsistency-based methodsquestion answeringmultinomial samplingprobability bounds

0 comments

The pith

Beam search generates better candidate sets than multinomial sampling for consistency-based uncertainty quantification in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Consistency-based uncertainty methods for large language models measure agreement across multiple generated answers. Multinomial sampling often yields duplicate answers in peaked distributions typical of short-form QA and produces high variance across runs. The paper replaces sampling with beam search to produce the candidate set and shows this yields higher accuracy plus lower variance on six QA datasets. It also derives a theoretical lower bound on the total probability mass of the beam set; when the bound is met, beam search provably reduces error relative to sampling. The result is a new family of consistency-based UQ methods that reach state-of-the-art performance.

Core claim

Beam search produces answer candidates whose agreement yields lower-error and lower-variance uncertainty estimates than multinomial sampling whenever the beam set captures a sufficient fraction of the total probability mass; a closed-form lower bound on that mass is given under which the error advantage is guaranteed.

What carries the argument

Beam search used to generate a fixed-size set of high-probability answer candidates for measuring consistency, together with the derived probability-mass lower bound that determines when it outperforms multinomial sampling.

If this is right

Uncertainty estimates become more stable across independent runs without increasing the number of generations.
State-of-the-art UQ performance is reached on six standard short-form QA benchmarks.
The theoretical bound supplies a practical diagnostic for when beam search is guaranteed to improve over sampling.
Consistency-based methods can be applied with less sensitivity to the choice of decoding temperature or top-p.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same beam-search candidate generation could be paired with other agreement-based or entropy-based UQ techniques beyond the consistency methods tested here.
If the probability-mass bound is routinely satisfied, beam search might allow fewer total generations while preserving or improving UQ quality.
The approach may transfer to longer-form generation tasks where duplicate answers are less common but variance remains an issue.

Load-bearing premise

Beam search will reliably produce a candidate set whose total probability mass exceeds the derived lower bound in the peaked short-form QA regime.

What would settle it

Empirical measurement on a new QA dataset showing that typical beam sets fall below the probability-mass bound or that the claimed reduction in variance across repeated runs disappears.

Figures

Figures reproduced from arXiv: 2512.09538 by Aleksandr Rubashevskii, Artem Shelmanov, Ekaterina Fadeeva, Maiya Goloburda, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Roman Vashurin.

**Figure 1.** Figure 1: Beam Search vs Multinomial Sampling. Sampling produces multiple identical generations resulting in noisy confidence estimate, while beam search covers top answers from LLM distribution resulting in a better confidence score. Today, large language models (LLMs) are increasingly being adapted in various safetycritical domains, including medicine (Busch et al., 2025), education (Xing et al., 2025), and la… view at source ↗

**Figure 2.** Figure 2: Mean percentage of redundant samples (i.e., outputs already seen among earlier generations) as a function of greedy output length. Results were obtained from 2,000 questions from the TriviaQA dataset using the Gemma 3 4B base model and 10 candidate generations. Redundancy is especially high for short answers, leading to wasted computation. Information-based methods rely on a single forward pass of the mod… view at source ↗

**Figure 3.** Figure 3: Percentage of texts meeting the sufficient condition (Theorem 1). Results are based on 2,000 TriviaQA questions, Gemma 3 4B base and M = 10. The green “All” bar shows the overall percentage across all lengths. From Theorem 1, beam-weighted estimator is more accurate than Monte Carlo estimator whenever total beam probability mass mB exceeds 1 − 1 2 √ M . For M = 10, the threshold is mB > 0.842. Thus, whe… view at source ↗

**Figure 4.** Figure 4: PRR (↑ is better) as a function of the number of candidates M on TriviaQA with Gemma 3 4B base. Each panel reports one estimator (Dissimilarity, Eccentricity, EigVecDissimilarity). Curves compare multinomial sampling and beam search (with probability weights from equation (4)). All experiments use M = 10 candidates for both multinomial sampling and beam search. We adopt the entailment probability from the … view at source ↗

**Figure 5.** Figure 5: PRR (↑ is better) for Dissimilarity under beam search (with probability weights) vs. multinomial sampling, for different output lengths. Each dataset (TriviaQA, CoQA) with Gemma 3 4B base is partitioned into five approximately equal-size bins token length of greedy output. 0.0 0.2 0.4 0.6 0.8 1.0 Rejection rate 0.6 0.7 0.8 0.9 1.0 Mean AlignScore Dissimilarity PR curve 0.0 0.2 0.4 0.6 0.8 1.0 Rejection ra… view at source ↗

**Figure 6.** Figure 6: Prediction-Rejection curves for Dissimilarity, Eccentricity, and EigVecDissimilarity on TriviaQA with Llama 3.1 8B base, comparing multinomial sampling (blue) and beam search with weights (orange). Oracle (black) and random (gray dashed) baselines are shown. The vertical dashed line marks the maximum rejection rate used in AUC calculations. 4.3.1 EFFECT OF SAMPLE COUNT We vary the sample count M ∈ {1, . . … view at source ↗

**Figure 7.** Figure 7: PRR (↑ is better) as a function of the number of candidates M on TriviaQA with Gemma 3 4B base for 3 UQ methods: Semantic Entropy, and sampling and beam search versions of Dissimilarity [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: PRR (↑ is better) as a function of the number of candidates M across different datasets and models. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Left: average probability mass covered by the candidate set (M=10) across output-length bins (averaged over examples in the bin) on TriviaQA with Gemma 3 4B base. Right: for beam search, distribution of sequence probabilities p(b (i) | x) by beam rank i (1 = highest-probability text) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Two examples from Gemma 3 4B base on TriviaQA. Each panel shows the question, [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Two examples from Gemma 3 4B base on WebQ. Each panel shows the question, greedy [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: One example from Gemma 3 4B base on CoQA. Shown are the question, greedy an [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Beam search replaces multinomial sampling in consistency UQ and delivers steadier estimates plus gains on six QA sets, but the supporting bound looks fragile in the peaked regime the paper targets.

read the letter

The main takeaway is that swapping beam search for standard sampling inside consistency-based uncertainty quantification cuts run-to-run variance and lifts performance on short-form QA. The authors also supply a lower bound on the probability mass captured by the beam set that is supposed to guarantee lower error than multinomial draws of the same size. That combination is the concrete novelty here; prior consistency work stayed with stochastic sampling and did not derive this kind of coverage guarantee. On the empirical side the paper reports consistent improvements across six datasets and claims state-of-the-art UQ numbers, which is useful incremental evidence even if the absolute margins are modest. The reduced variance is the clearest practical win, since practitioners care about stable uncertainty scores across repeated runs. The bound itself is the part that needs scrutiny. It only kicks in once the beam set exceeds a certain mass threshold, and in the highly peaked token distributions typical of short QA (low temperature, top-p filtering) that threshold can be high enough that a modest beam width often falls short. The manuscript does not appear to report per-query beam-mass statistics, so it is hard to tell how often the theoretical condition actually holds versus how much of the observed gain comes from simply avoiding duplicate generations. If the bound is met only sporadically, the superiority claim rests more on the empirical pattern than on the proof. The experiments look standard for the area, but without the full protocol details on beam widths, temperature settings, and exact consistency metrics it is difficult to judge reproducibility. Overall the work is a straightforward methodological tweak with a supporting derivation rather than a paradigm shift. Readers who already run consistency UQ on QA tasks will find the beam-search variant worth trying, especially if they value lower variance. The paper is coherent on its own terms and engages the relevant literature, so it clears the bar for a serious referee. I would send it out for review with the expectation that the authors supply the missing per-query mass numbers and tighten the discussion of when the bound applies.

Referee Report

2 major / 2 minor

Summary. The paper proposes a family of consistency-based uncertainty quantification methods for LLMs that replace multinomial sampling with beam search to generate candidate sets. It claims improved performance and lower variance on six short-form QA datasets, supported by a theoretical lower bound B on the total probability mass of the beam set such that, when exceeded, beam search provably yields lower consistency error than multinomial sampling with the same number of draws.

Significance. If the bound is met in practice and the gains are attributable to the theoretical mechanism rather than secondary effects such as reduced duplicates, the work could offer a practical, lower-variance alternative for UQ in peaked LLM distributions, strengthening consistency-based methods without additional training.

major comments (2)

[§4] §4 (Theoretical Bound), Eq. (bound): the derivation supplies a lower bound B on beam-set mass guaranteeing smaller error than multinomial sampling, yet the manuscript reports no per-query beam-mass statistics or fraction of queries where mass exceeds B. In the peaked regime typical of short-form QA (top-5 mass often >0.9), B can be large enough that a beam of width 10 fails to exceed it on a non-negligible fraction of examples; without these diagnostics the empirical gains cannot be confidently attributed to the bound rather than reduced duplicates or lower run-to-run variance.
[§5.2] §5.2 (Experiments), Table 2: consistent improvements are shown across six datasets, but exact beam widths, temperature, and top-p values are not tabulated per dataset, and no ablation isolates the contribution of the bound versus secondary effects (e.g., deterministic coverage). This leaves the central claim that beam search is superior precisely when the bound holds unverified.

minor comments (2)

[§3] §3 (Method): the precise definition of the consistency metric (e.g., exact match vs. semantic equivalence) and how ties are broken in beam search should be stated explicitly for reproducibility.
[Figure 1] Figure 1: axis labels and legend entries are too small; increasing font size would improve readability of the variance comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the link between our theoretical bound and empirical results. We address each major point below and will revise the manuscript to include the requested diagnostics, hyperparameter details, and ablation analysis.

read point-by-point responses

Referee: [§4] §4 (Theoretical Bound), Eq. (bound): the derivation supplies a lower bound B on beam-set mass guaranteeing smaller error than multinomial sampling, yet the manuscript reports no per-query beam-mass statistics or fraction of queries where mass exceeds B. In the peaked regime typical of short-form QA (top-5 mass often >0.9), B can be large enough that a beam of width 10 fails to exceed it on a non-negligible fraction of examples; without these diagnostics the empirical gains cannot be confidently attributed to the bound rather than reduced duplicates or lower run-to-run variance.

Authors: We agree that these statistics are needed to attribute gains to the bound rather than secondary effects. In the revision we will add per-query beam-set probability mass distributions (as histograms or tables) for all six datasets and explicitly report the fraction of queries where the mass exceeds B. This will show how often the theoretical condition holds under the peaked distributions typical of short-form QA. revision: yes
Referee: [§5.2] §5.2 (Experiments), Table 2: consistent improvements are shown across six datasets, but exact beam widths, temperature, and top-p values are not tabulated per dataset, and no ablation isolates the contribution of the bound versus secondary effects (e.g., deterministic coverage). This leaves the central claim that beam search is superior precisely when the bound holds unverified.

Authors: We will add a supplementary table listing the exact beam width, temperature, and top-p values used for every dataset and method. We will also include a new ablation that matches duplicate rates between beam search and multinomial sampling (via post-hoc deduplication or deterministic variants) to isolate the contribution of the probability-mass bound from reduced stochasticity. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical bound derived from probability mass comparison; empirical results independent

full rationale

The paper derives a lower bound B on beam-set probability mass such that beam search yields lower consistency error than multinomial sampling whenever the bound is met. This comparison is stated directly in terms of collision probabilities and top-k coverage without any fitted parameters defined from the same data or self-citation chains. The empirical evaluation uses standard QA benchmarks with no reported reduction of predictions to inputs by construction. No self-definitional, fitted-input, or ansatz-smuggling steps appear in the provided derivation outline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard LLM decoding assumptions plus one domain-specific premise about peaked output distributions in short QA.

free parameters (1)

beam width
Hyper-parameter controlling how many candidates beam search retains; value not specified in abstract.

axioms (1)

domain assumption Beam search maintains a set whose total probability mass exceeds a computable lower bound in peaked distributions
Invoked to guarantee smaller error than multinomial sampling.

pith-pipeline@v0.9.0 · 5473 in / 1261 out tokens · 25538 ms · 2026-05-16T23:47:38.765968+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Comparison condition for beam-weighted and Monte Carlo estimators). ... m_B > 1 - 1/(2√M)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024