pith. sign in

arxiv: 2606.27359 · v1 · pith:DEMLLILRnew · submitted 2026-06-25 · 📊 stat.ML · cs.LG

When are likely answers right? On Sequence Probability and Correctness in LLMs

Pith reviewed 2026-06-26 01:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords sequence probabilityLLM decodingcorrectnessaccuracyself-consistencyverifier-free self-improvementdecoding methods
0
0 comments X

The pith

Sequence probability predicts correctness across prompt-answer pairs in a dataset but not when changing decoding methods or within responses to one prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper quantifies how well the conditional probability of a full answer given a prompt aligns with whether that answer is correct. It tests this alignment across four levels: different decoding methods, different hyperparameters inside one method, different prompt-answer pairs inside one dataset, and multiple answers generated for the identical prompt. A positive link appears when comparing answers across a fixed dataset, yet the link disappears when probability is raised by switching methods or settings, and it also fails to separate correct from incorrect answers to any single prompt. The results therefore limit how much correctness can be expected to rise simply by steering generation toward higher-probability sequences.

Core claim

The paper establishes that sequence probability aligns with correctness when comparing different prompt-answer pairs inside one dataset, yet this alignment does not extend to choices made by varying decoding hyperparameters or methods, nor does it hold for selecting among responses to a single prompt.

What carries the argument

Sequence probability (the model's conditional probability of an entire continuation given the prompt) and its measured correlation with correctness at four distinct levels of comparison.

If this is right

  • Decoding methods that increase sequence probability cannot be assumed to raise accuracy.
  • Hyperparameter changes that raise sequence probability do not reliably improve performance.
  • Sequence probability cannot be used to pick the correct answer among multiple generations for the same prompt.
  • Self-consistency and verifier-free self-improvement techniques that rely on probability need to account for these limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives that directly encourage probability to track correctness may be needed rather than relying on post-training decoding adjustments.
  • Signals other than sequence probability, such as cross-sample consistency, could be combined with or replace probability for answer selection.
  • Dataset-specific thresholds on probability might still allow rough correctness prediction even if they do not guide decoding.

Load-bearing premise

The models, benchmarks, and decoding methods tested are representative enough that the observed probability-correctness patterns apply beyond the specific cases examined.

What would settle it

A controlled experiment in which a new decoding method or hyperparameter setting raises sequence probability on multiple benchmarks and also raises accuracy would falsify the claim that such increases do not reliably improve correctness.

Figures

Figures reproduced from arXiv: 2606.27359 by Johannes Zenn, Jonas Geiping.

Figure 1
Figure 1. Figure 1: The correlation between log-probability and correctness across three levels. Left: Across-method-correlation: log-probability plotted against accuracy for various decoding methods and hyperparameters (connected by lines). Within-method correlation: plotted without transparency. Middle: Within-dataset correlation: for one method (BoN with N = 32) we bin correct (green dots, top) and incorrect data (red dots… view at source ↗
Figure 2
Figure 2. Figure 2: Within-dataset correlation: log-probability and correctness are correlated for each dataset. Each panel scatters samples using log-probability and correctness. ρ: Spearman correlation coefficient. r: (binned) Pearson correlation coefficient. Correlation is strongest for MATH500 and smaller but positive for GPQA, HumanEval, MedQA, and MMLU. Only IFEval shows a negative correlation. Plots show Qwen3-8B-Base … view at source ↗
Figure 3
Figure 3. Figure 3: Within-dataset correlation: log-probability and correctness is consistently correlated across model families and datasets, largely independent of methods. Base models show more negative and diverse correlation and are consistently negative for IFEval. Posttrained models show consistently positive correlation. Correlation coefficients r are averaged over model sizes at a canonical hyperparameter for each me… view at source ↗
Figure 4
Figure 4. Figure 4: Within-method correlation: while sequences are achieving higher log-probability, one cannot predict correctness by changing a hyperparameter of the method. Correlations within methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets … view at source ↗
Figure 5
Figure 5. Figure 5: Across-method correlation: many methods produce samples that are both more probable and more accurate than the low-temperature sampling (LTS) baseline. However, correlations are not consistent across models and datasets. Correlations across methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and ben… view at source ↗
Figure 6
Figure 6. Figure 6: Within-dataset correlation: the correlation between log-probability and correctness increases with accuracy. Probability-based self-improvement loops would require the model to have sufficient accuracy. Accuracy is plotted against correlation across all datasets, methods, and models (left: base, right: posttrained, at canonical hyperparameter). We observe positive correlation (base: r = 0.66, posttrained: … view at source ↗
Figure 7
Figure 7. Figure 7: Color coding for [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Within-sample correlation: correlation coefficients are distributed symmetrically around zero (left, middle). Further (right), the more correct a set of continuations for a single prompt is, the more positive is its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Right: Correlation coefficient and … view at source ↗
Figure 9
Figure 9. Figure 9: Color coding for [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Power self-consistency improves over self-consistency on MATH500. On GPQA and MMLU power self-consistency degrades performance. Plot shows data for the Qwen3 model series across various number of samples k. See Section 3.4 for a discussion. thought to be more calibrated than posttrained models which does not seem to hold true here. On a different note, one can pose that for probability-based verifier-free… view at source ↗
Figure 11
Figure 11. Figure 11: Self-consistency with probability-weighted voting often underperforms majority voting. Power self-consistency does not shows consistent differences. Accuracy difference between (power) self-consistency using uniform or probability-weighted majority voting. Results on Qwen3 model series across three datasets. See Section 3.4 for details. weighting mostly degrades performance. This can be explained by Findi… view at source ↗
Figure 12
Figure 12. Figure 12: We observe consistent within-dataset correlation across model families and datasets largely independent of methods. Base models show more negative and diverse correlation and are consistently negative for IFEval. Posttrained models show consistently positive correlation. Correlation coefficients ρ averaged over model sizes at representative hyperparameter for each method. Methods are in the legend. Datase… view at source ↗
Figure 13
Figure 13. Figure 13: and [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Methods show either positive or negative correlation within their hyperparameter. Correlations within methods for local and global decoding methods, models (Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. Detailed discussion in Section 3.2. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: and [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Various datasets and methods show positive correlations between sequence prob￾ability and correctness. Correlations across methods for local and global decoding methods, models(Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. See Section 3.3. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Within-sample correlation coef￾ficients are generally small with MATH500 consistently showing much larger values. Datasets are plotted horizontally on the bottom: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. In this section, we extend our analysis of Sec￾tion 3.4 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The more correct a sample is under repeated sampling, the stronger its correlation coefficient. We observe similar results across all model families (columns) and variants (rows). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Correlation coefficients are mostly symmetrically distributed around zero, the more correct a sample is, the more positive its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Correlation is computed within a sample across methods Right: Correlation coefficient and fraction of correct samples seems… view at source ↗
Figure 20
Figure 20. Figure 20: Correlation coefficients are mostly symmetrically distributed around zero, the more correct a sample is, the more positive its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Correlation is computed within a sample across methods Right: Correlation coefficient and fraction of correct samples seems… view at source ↗
Figure 21
Figure 21. Figure 21: Within-sample correlation coefficients are generally small with MATH500 con￾sistently showing much larger values. Datasets are plotted horizontally on the bottom: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. We show our results in [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: and [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: The more correct a sample is under repeated sampling, the stronger its correlation coefficient. We observe similar results across all model families (columns) and variants (rows). Same as [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Beam search often achieves smallest log-probabilities; BoN, SPS, and power-SMC have larger log-probabilities. Log-probabilities averaged over model sizes. Typically, BoN, SPS, and power-SMC lead to smallest log-probabilities. LTS is typically larger and beam search is largest. Olmo3 seems to be an outlier for power-SMC where log-probabilities appear very small. Datasets are plotted horizontally on the bot… view at source ↗
Figure 25
Figure 25. Figure 25: Global decoding methods produce sequences of larger per-token log-probability that are smoother in posttrained models. For base models, LTS samples are shortest (top row), SPS and BoN samples appear longer. There is no meaningful difference between log-probability and smoothness. For posttrained models, we observe larger per-token log-probability and smoother trajectories for global decoding methods. Qwen… view at source ↗
Figure 26
Figure 26. Figure 26: Global decoding methods produce sequences of larger per-token log-probability that are smoother in posttrained models. For base models, LTS samples are shortest (top row), SPS and BoN samples appear longer. There is no meaningful difference between log-probability and smoothness. For posttrained models, we observe larger per-token log-probability and smoother trajectories for global decoding methods. Qwen… view at source ↗
Figure 27
Figure 27. Figure 27: we find that using per-token log-probability does not change the within-dataset correlation significantly when considering Spearman ρ. For Pearson r, many points get pushed to the corner; the off-diagonal elements seem to be consistent for ρ and r. −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 Spearman ρ (sum) −0.5 0.0 0.5 Spearman ρ (average) GPQA HumanEval IFEval MATH500 MedQA MMLU −1.0 −0.5 0.0 0.5 1.0 Pearson r (sum)… view at source ↗
Figure 28
Figure 28. Figure 28: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. Within-method correlations. C… view at source ↗
Figure 29
Figure 29. Figure 29: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Qwen2.5 series 8B as ∗1, Math 8B as ∗2 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. B1 P1 +/+ -/+ +/+ +/- -/+ +/++/+ -/- +/++/+ -/+ +/… view at source ↗
Figure 30
Figure 30. Figure 30: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Various datasets and methods show positive correlations between sequence proba￾bility and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. B1 B2 P1… view at source ↗
Figure 32
Figure 32. Figure 32: Various datasets and methods show positive correlations between sequence proba￾bility and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models (Qwen2.5 series 8B as ∗1, Math 8B as ∗2 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 33 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 33
Figure 33. Figure 33: Various datasets and methods show positive correlations between sequence proba￾bility and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models(Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: and [PITH_FULL_IMAGE:figures/full_fig_p035_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Sequence lengths overlap across methods, beam search is the only exception. Data for Qwen3-8B. Details in Appendix E.4. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Left: Comparing mode and power-distribution we find no consistent differences in accuracy. Middle, right: BoN typically leads to shorter samples that are of smaller probability. Datasets are plotted horizontally on the bottom of each panel: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. Qwen3 model family. Comparing correctness across BoN and power-distribution in [PITH_FULL_IMAGE:figures/full_fig_p036_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Self-distillation shows diverse results across methods. On MATH500 power self￾distillation seems to often improve results. Else, results are mixed. Qwen3 base models. In [PITH_FULL_IMAGE:figures/full_fig_p037_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Left: Qwen3 thinking mostly improves on Humaneval, MMLU, and MedQA; the strongest decline can be observed on IFEval. Across methods, beam search often performs worst. Right: Thinking models often reach the token limit. The mean sequence length for thinking models is much higher as compared to non-thinking models. In [PITH_FULL_IMAGE:figures/full_fig_p038_38.png] view at source ↗
read the original abstract

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript examines the alignment between sequence probability (conditional probability of a continuation given a prompt) and correctness in LLMs. It quantifies the relationship at four levels—across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt—finding that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset, but that increasing sequence probability via changes to hyperparameters or methods does not reliably improve accuracy, and that sequence probability is not a reliable indicator of correctness for responses to the same prompt. The work aims to clarify when decoding methods can and cannot be expected to improve correctness.

Significance. If the reported patterns hold under broader testing, the multi-level empirical analysis provides useful practical guidance for decoding choices, self-consistency, and verifier-free self-improvement in LLMs by delineating where probability-correctness alignment exists and where it fails to transfer. The four-level breakdown is a strength, as it separates within-dataset correlations from causal effects of decoding interventions.

major comments (2)
  1. [Abstract] Abstract: The headline claims use qualifiers such as 'often predictive' and 'does not generally transfer.' These rest on the assumption that the tested models, benchmarks, and decoding methods are representative; however, the abstract provides no enumeration of the specific models (scale, families), benchmarks (task types), or methods, making it impossible to assess whether the negative transfer result is an artifact of a narrow experimental slice (e.g., only small models or multiple-choice tasks).
  2. [Abstract and experimental sections] The central empirical claims at the 'across decoding methods' and 'across hyperparameters' levels require evidence that the observed lack of accuracy improvement when sequence probability is increased is not driven by post-hoc benchmark or method selection. Without explicit reporting of the full set of methods, models, and controls (including statistical tests and exclusion criteria), the 'does not generally transfer' conclusion cannot be evaluated for robustness.
minor comments (1)
  1. [Introduction] Clarify the precise definition of 'sequence probability' (e.g., whether it is the product of token probabilities or normalized) with an equation early in the paper to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our experimental scope. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims use qualifiers such as 'often predictive' and 'does not generally transfer.' These rest on the assumption that the tested models, benchmarks, and decoding methods are representative; however, the abstract provides no enumeration of the specific models (scale, families), benchmarks (task types), or methods, making it impossible to assess whether the negative transfer result is an artifact of a narrow experimental slice (e.g., only small models or multiple-choice tasks).

    Authors: We agree that the abstract should briefly enumerate the experimental scope to support the qualifiers. In revision we will add a concise clause listing the model families and scales, benchmark task types (including both multiple-choice and generative), and decoding methods tested, while preserving length constraints. revision: yes

  2. Referee: [Abstract and experimental sections] The central empirical claims at the 'across decoding methods' and 'across hyperparameters' levels require evidence that the observed lack of accuracy improvement when sequence probability is increased is not driven by post-hoc benchmark or method selection. Without explicit reporting of the full set of methods, models, and controls (including statistical tests and exclusion criteria), the 'does not generally transfer' conclusion cannot be evaluated for robustness.

    Authors: The body already reports results aggregated over the full set of models, benchmarks, and methods used. To address the concern directly we will add an appendix that lists every configuration considered, states inclusion/exclusion criteria, and reports the statistical tests applied to accuracy differences. This will make the robustness of the 'does not generally transfer' finding transparent without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study

full rationale

The paper conducts an empirical analysis of sequence probability vs. correctness across decoding methods, models, benchmarks, and prompt levels. It reports observational patterns from experiments without any derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce claims to inputs by construction. No self-definitional, fitted-input, or uniqueness-theorem patterns appear. The central claims rest on direct measurement rather than reduction to prior fitted quantities or author-specific ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on standard empirical correlation and comparison methods in machine learning.

axioms (1)
  • standard math Standard statistical assumptions underlying correlation and predictive-power measurements hold for the chosen benchmarks and models.
    The reported predictive relationships presuppose that conventional statistical tools apply without hidden biases in the LLM output distributions.

pith-pipeline@v0.9.1-grok · 5719 in / 1269 out tokens · 29193 ms · 2026-06-26T01:53:17.316822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 14 linked inside Pith

  1. [1]

    Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning

    Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, and Massoud Pedram. Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning. arXiv preprint arXiv:2602.10273, 2026

  2. [2]

    Smoothing algorithms for state--space models

    Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state--space models. Annals of the Institute of Statistical Mathematics, 62: 0 61--89, 2010

  3. [3]

    Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding

    Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024 a

  4. [4]

    Evaluating large language models trained on code, 2021

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code, 2021. arXiv preprint arXiv:2107.03374, 2025

  5. [5]

    Universal self-consistency for large language models

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. In International Conference on Machine Learning, Workshop on In-Context Learning, 2024 b

  6. [6]

    Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference

    Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference . The Annals of Statistics, 32 0 (6): 0 2385 -- 2411, 2004

  7. [7]

    An introduction to sequential Monte Carlo, volume 4

    Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

  8. [8]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    Empirical analysis of beam search performance degradation in neural sequence models

    Eldan Cohen and Christopher Beck. Empirical analysis of beam search performance degradation in neural sequence models. In International Conference on Machine Learning, 2019

  10. [10]

    Feynman-Kac Formulae, pages 47--93

    Pierre Del Moral. Feynman-Kac Formulae, pages 47--93. Springer New York, 2004

  11. [11]

    Sequential monte carlo samplers

    Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 0 (3): 0 411--436, 2006

  12. [12]

    An introduction to sequential monte carlo methods

    Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

  13. [13]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation . In Iryna Gurevych and Yusuke Miyao, editors, Annual Meeting of the Association for Computational Linguistics , Long Papers , 2018

  14. [14]

    Deep Think with Confidence

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep Think with Confidence . arXiv preprint arXiv:2508.15260, 2025

  15. [15]

    J. Gai, G. Zeng, H. Zhang, and A. Raghunathan. Differential smoothing mitigates sharpening and improves LLM reasoning. arXiv preprint arXiv:2511.19942, 2025

  16. [16]

    W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57 0 (1): 0 97--109, 1970

  17. [17]

    He, Daniel Fried, and Sean Welleck

    Amy W. He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Conference on Empirical Methods in Natural Language Processing, 2025

  18. [18]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  19. [19]

    Manning, and Percy Liang

    John Hewitt, Christopher D. Manning, and Percy Liang. Truncation Sampling as Language Model Desmoothing . arXiv preprint arXiv:2210.15191, 2022

  20. [20]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  21. [21]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration . arXiv preprint arXiv:1904.09751, 2020

  22. [22]

    Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening

    Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, and Haitham Bou Ammar. Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening. arXiv preprint arXiv:2601.21590, 2026

  23. [23]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

  24. [24]

    Language models (mostly) know what they know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  25. [25]

    Scalable Best -of- N Selection for Large Language Models via Self - Certainty

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable Best -of- N Selection for Large Language Models via Self - Certainty . arXiv preprint arXiv:2502.18581, 2025

  26. [26]

    Reasoning with sampling: Your base model is smarter than you think

    Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

  27. [27]

    Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models

    Tom Kempton and Stuart Burrell. Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models. arXiv preprint arXiv:2503.21929, 2025

  28. [28]

    Six Challenges for Neural Machine Translation

    Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation . In Annual Meeting of the Association for Computational Linguistics, Workshop on Neural Machine Translation , 2017

  29. [29]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Symposium on Operating Systems Principles, 2023

  30. [30]

    Sixo: Smoothing inference with twisted objectives

    Dieterich Lawson, Allan Ravent \'o s, Andrew Warrington, and Scott Linderman. Sixo: Smoothing inference with twisted objectives. In Advances in Neural Information Processing Systems, 2022

  31. [31]

    Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s

    Xueyan Li, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s. In International Conference on Learning Representations, 2026 a

  32. [32]

    -leaf enumeration: Non-repeating self-consistency via truncated tree search

    Xueyan Li, Johannes Zenn, Ekaterina Fadeeva, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. -leaf enumeration: Non-repeating self-consistency via truncated tree search. International Conference on Learning Representations (ICLR), Workshop on Latent & Implicit Thinking, 2026 b

  33. [33]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024

  34. [34]

    Critic sequential monte carlo

    Vasileios Lioutas, Jonathan Wilder Lavington, Justice Sefas, Matthew Niedoba, Yunpeng Liu, Berend Zwartsenberg, Setareh Dabiri, Frank Wood, and Adam Scibior. Critic sequential monte carlo. In International Conference on Learning Representations, 2022

  35. [35]

    The harpy speech recognition system

    Bruce T Lowerre. The harpy speech recognition system. Carnegie Mellon University, 1976

  36. [36]

    Equation of state calculations by fast computing machines

    Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21 0 (6): 0 1087--1092, 1953

  37. [37]

    K. Ni, Z. Tan, Z. Liu, P. Li, and T. Chen. Can GRPO help LLMs transcend their pretraining origin? arXiv preprint arXiv:2510.15990, 2025

  38. [38]

    Karl Pearson. Iii. contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. (A.), 0 (185): 0 71--110, 12 1894

  39. [39]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In Conference on Language Modeling, 2024

  40. [40]

    Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for LLM reasoning. arXiv preprint arXiv:2509.06941, 2025

  41. [41]

    Spearman

    C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15 0 (1): 0 72--101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159

  42. [42]

    Calibration and correctness of language models for code

    Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and correctness of language models for code. In International Conference on Software Engineering, 2025

  43. [43]

    A contrastive framework for neural text generation

    Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems, 2022

  44. [44]

    Sequence to sequence learning with neural networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 2014

  45. [45]

    Confidence improves self-consistency in LLM s

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in LLM s. In Findings of the Association for Computational Linguistics, 2025

  46. [46]

    Diverse beam search: Decoding diverse solutions from neural sequence models

    Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424, 2016

  47. [47]

    Arithmetic sampling: parallel diverse decoding for large language models

    Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Tachard Passos, and Sumit Sanghai. Arithmetic sampling: parallel diverse decoding for large language models. In International Conference on Machine Learning, 2023

  48. [48]

    Soft Self-Consistency Improves Language Models Agents

    Han Wang, Archiki Prasad, Elias Stengel-Eskin , and Mohit Bansal. Soft Self-Consistency Improves Language Models Agents . In Annual Meeting of the Association for Computational Linguistics , Short Papers , 2024 a

  49. [49]

    Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation

    Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation. In Annual Meeting of the Association for Computational Linguistics, Long Papers, 2024 b

  50. [50]

    Self- Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self- Consistency Improves Chain of Thought Reasoning in Language Models . arXiv preprint arXiv:2203.11171, 2023

  51. [51]

    MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark . arXiv preprint arXiv:2406.01574, 2024 c

  52. [52]

    Qwen3 technical report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  53. [53]

    Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation

    Yilin Yang, Liang Huang, and Mingbo Ma. Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation . In Conference on Empirical Methods in Natural Language Processing , 2018

  54. [54]

    Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  55. [55]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, 2022

  56. [56]

    Be your own teacher: Improve the performance of convolutional neural networks via self distillation

    Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In International Conference on Computer Vision, 2019

  57. [57]

    Probabilistic inference in language models via twisted sequential monte carlo

    Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference in language models via twisted sequential monte carlo. In International Conference on Machine Learning, 2024

  58. [58]

    Instruction-following evaluation for large language models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023