pith. sign in

arxiv: 2512.09538 · v2 · submitted 2025-12-10 · 📊 stat.ML · cs.CL· cs.LG

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Pith reviewed 2026-05-16 23:47 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG
keywords uncertainty quantificationlarge language modelsbeam searchconsistency-based methodsquestion answeringmultinomial samplingprobability bounds
0
0 comments X

The pith

Beam search generates better candidate sets than multinomial sampling for consistency-based uncertainty quantification in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Consistency-based uncertainty methods for large language models measure agreement across multiple generated answers. Multinomial sampling often yields duplicate answers in peaked distributions typical of short-form QA and produces high variance across runs. The paper replaces sampling with beam search to produce the candidate set and shows this yields higher accuracy plus lower variance on six QA datasets. It also derives a theoretical lower bound on the total probability mass of the beam set; when the bound is met, beam search provably reduces error relative to sampling. The result is a new family of consistency-based UQ methods that reach state-of-the-art performance.

Core claim

Beam search produces answer candidates whose agreement yields lower-error and lower-variance uncertainty estimates than multinomial sampling whenever the beam set captures a sufficient fraction of the total probability mass; a closed-form lower bound on that mass is given under which the error advantage is guaranteed.

What carries the argument

Beam search used to generate a fixed-size set of high-probability answer candidates for measuring consistency, together with the derived probability-mass lower bound that determines when it outperforms multinomial sampling.

If this is right

  • Uncertainty estimates become more stable across independent runs without increasing the number of generations.
  • State-of-the-art UQ performance is reached on six standard short-form QA benchmarks.
  • The theoretical bound supplies a practical diagnostic for when beam search is guaranteed to improve over sampling.
  • Consistency-based methods can be applied with less sensitivity to the choice of decoding temperature or top-p.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same beam-search candidate generation could be paired with other agreement-based or entropy-based UQ techniques beyond the consistency methods tested here.
  • If the probability-mass bound is routinely satisfied, beam search might allow fewer total generations while preserving or improving UQ quality.
  • The approach may transfer to longer-form generation tasks where duplicate answers are less common but variance remains an issue.

Load-bearing premise

Beam search will reliably produce a candidate set whose total probability mass exceeds the derived lower bound in the peaked short-form QA regime.

What would settle it

Empirical measurement on a new QA dataset showing that typical beam sets fall below the probability-mass bound or that the claimed reduction in variance across repeated runs disappears.

Figures

Figures reproduced from arXiv: 2512.09538 by Aleksandr Rubashevskii, Artem Shelmanov, Ekaterina Fadeeva, Maiya Goloburda, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Roman Vashurin.

Figure 1
Figure 1. Figure 1: Beam Search vs Multinomial Sam￾pling. Sampling produces multiple identical gen￾erations resulting in noisy confidence estimate, while beam search covers top answers from LLM distribution resulting in a better confidence score. Today, large language models (LLMs) are in￾creasingly being adapted in various safety￾critical domains, including medicine (Busch et al., 2025), education (Xing et al., 2025), and la… view at source ↗
Figure 2
Figure 2. Figure 2: Mean percentage of redundant samples (i.e., outputs already seen among earlier generations) as a function of greedy output length. Results were obtained from 2,000 questions from the TriviaQA dataset using the Gemma 3 4B base model and 10 candidate generations. Redundancy is es￾pecially high for short answers, leading to wasted computation. Information-based methods rely on a single forward pass of the mod… view at source ↗
Figure 3
Figure 3. Figure 3: Percentage of texts meeting the suffi￾cient condition (Theorem 1). Results are based on 2,000 TriviaQA questions, Gemma 3 4B base and M = 10. The green “All” bar shows the overall percentage across all lengths. From Theorem 1, beam-weighted estimator is more accurate than Monte Carlo estimator whenever total beam probability mass mB ex￾ceeds 1 − 1 2 √ M . For M = 10, the thresh￾old is mB > 0.842. Thus, whe… view at source ↗
Figure 4
Figure 4. Figure 4: PRR (↑ is better) as a function of the number of candidates M on TriviaQA with Gemma 3 4B base. Each panel reports one estimator (Dissimilarity, Eccentricity, EigVecDissimilarity). Curves compare multinomial sampling and beam search (with probability weights from equation (4)). All experiments use M = 10 candidates for both multinomial sampling and beam search. We adopt the entailment probability from the … view at source ↗
Figure 5
Figure 5. Figure 5: PRR (↑ is better) for Dissimilarity under beam search (with probability weights) vs. multi￾nomial sampling, for different output lengths. Each dataset (TriviaQA, CoQA) with Gemma 3 4B base is partitioned into five approximately equal-size bins token length of greedy output. 0.0 0.2 0.4 0.6 0.8 1.0 Rejection rate 0.6 0.7 0.8 0.9 1.0 Mean AlignScore Dissimilarity PR curve 0.0 0.2 0.4 0.6 0.8 1.0 Rejection ra… view at source ↗
Figure 6
Figure 6. Figure 6: Prediction-Rejection curves for Dissimilarity, Eccentricity, and EigVecDissimilarity on TriviaQA with Llama 3.1 8B base, comparing multinomial sampling (blue) and beam search with weights (orange). Oracle (black) and random (gray dashed) baselines are shown. The vertical dashed line marks the maximum rejection rate used in AUC calculations. 4.3.1 EFFECT OF SAMPLE COUNT We vary the sample count M ∈ {1, . . … view at source ↗
Figure 7
Figure 7. Figure 7: PRR (↑ is better) as a function of the number of candidates M on TriviaQA with Gemma 3 4B base for 3 UQ methods: Semantic Entropy, and sampling and beam search versions of Dissim￾ilarity [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PRR (↑ is better) as a function of the number of candidates M across different datasets and models. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: average probability mass covered by the candidate set (M=10) across output-length bins (averaged over examples in the bin) on TriviaQA with Gemma 3 4B base. Right: for beam search, distribution of sequence probabilities p(b (i) | x) by beam rank i (1 = highest-probability text) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Two examples from Gemma 3 4B base on TriviaQA. Each panel shows the question, [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Two examples from Gemma 3 4B base on WebQ. Each panel shows the question, greedy [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: One example from Gemma 3 4B base on CoQA. Shown are the question, greedy an [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a family of consistency-based uncertainty quantification methods for LLMs that replace multinomial sampling with beam search to generate candidate sets. It claims improved performance and lower variance on six short-form QA datasets, supported by a theoretical lower bound B on the total probability mass of the beam set such that, when exceeded, beam search provably yields lower consistency error than multinomial sampling with the same number of draws.

Significance. If the bound is met in practice and the gains are attributable to the theoretical mechanism rather than secondary effects such as reduced duplicates, the work could offer a practical, lower-variance alternative for UQ in peaked LLM distributions, strengthening consistency-based methods without additional training.

major comments (2)
  1. [§4] §4 (Theoretical Bound), Eq. (bound): the derivation supplies a lower bound B on beam-set mass guaranteeing smaller error than multinomial sampling, yet the manuscript reports no per-query beam-mass statistics or fraction of queries where mass exceeds B. In the peaked regime typical of short-form QA (top-5 mass often >0.9), B can be large enough that a beam of width 10 fails to exceed it on a non-negligible fraction of examples; without these diagnostics the empirical gains cannot be confidently attributed to the bound rather than reduced duplicates or lower run-to-run variance.
  2. [§5.2] §5.2 (Experiments), Table 2: consistent improvements are shown across six datasets, but exact beam widths, temperature, and top-p values are not tabulated per dataset, and no ablation isolates the contribution of the bound versus secondary effects (e.g., deterministic coverage). This leaves the central claim that beam search is superior precisely when the bound holds unverified.
minor comments (2)
  1. [§3] §3 (Method): the precise definition of the consistency metric (e.g., exact match vs. semantic equivalence) and how ties are broken in beam search should be stated explicitly for reproducibility.
  2. [Figure 1] Figure 1: axis labels and legend entries are too small; increasing font size would improve readability of the variance comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the link between our theoretical bound and empirical results. We address each major point below and will revise the manuscript to include the requested diagnostics, hyperparameter details, and ablation analysis.

read point-by-point responses
  1. Referee: [§4] §4 (Theoretical Bound), Eq. (bound): the derivation supplies a lower bound B on beam-set mass guaranteeing smaller error than multinomial sampling, yet the manuscript reports no per-query beam-mass statistics or fraction of queries where mass exceeds B. In the peaked regime typical of short-form QA (top-5 mass often >0.9), B can be large enough that a beam of width 10 fails to exceed it on a non-negligible fraction of examples; without these diagnostics the empirical gains cannot be confidently attributed to the bound rather than reduced duplicates or lower run-to-run variance.

    Authors: We agree that these statistics are needed to attribute gains to the bound rather than secondary effects. In the revision we will add per-query beam-set probability mass distributions (as histograms or tables) for all six datasets and explicitly report the fraction of queries where the mass exceeds B. This will show how often the theoretical condition holds under the peaked distributions typical of short-form QA. revision: yes

  2. Referee: [§5.2] §5.2 (Experiments), Table 2: consistent improvements are shown across six datasets, but exact beam widths, temperature, and top-p values are not tabulated per dataset, and no ablation isolates the contribution of the bound versus secondary effects (e.g., deterministic coverage). This leaves the central claim that beam search is superior precisely when the bound holds unverified.

    Authors: We will add a supplementary table listing the exact beam width, temperature, and top-p values used for every dataset and method. We will also include a new ablation that matches duplicate rates between beam search and multinomial sampling (via post-hoc deduplication or deterministic variants) to isolate the contribution of the probability-mass bound from reduced stochasticity. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical bound derived from probability mass comparison; empirical results independent

full rationale

The paper derives a lower bound B on beam-set probability mass such that beam search yields lower consistency error than multinomial sampling whenever the bound is met. This comparison is stated directly in terms of collision probabilities and top-k coverage without any fitted parameters defined from the same data or self-citation chains. The empirical evaluation uses standard QA benchmarks with no reported reduction of predictions to inputs by construction. No self-definitional, fitted-input, or ansatz-smuggling steps appear in the provided derivation outline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard LLM decoding assumptions plus one domain-specific premise about peaked output distributions in short QA.

free parameters (1)
  • beam width
    Hyper-parameter controlling how many candidates beam search retains; value not specified in abstract.
axioms (1)
  • domain assumption Beam search maintains a set whose total probability mass exceeds a computable lower bound in peaked distributions
    Invoked to guarantee smaller error than multinomial sampling.

pith-pipeline@v0.9.0 · 5473 in / 1261 out tokens · 25538 ms · 2026-05-16T23:47:38.765968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    Victoria Beckham

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...