When are likely answers right? On Sequence Probability and Correctness in LLMs

Johannes Zenn; Jonas Geiping

arxiv: 2606.27359 · v1 · pith:DEMLLILRnew · submitted 2026-06-25 · 📊 stat.ML · cs.LG

When are likely answers right? On Sequence Probability and Correctness in LLMs

Johannes Zenn , Jonas Geiping This is my paper

Pith reviewed 2026-06-26 01:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords sequence probabilityLLM decodingcorrectnessaccuracyself-consistencyverifier-free self-improvementdecoding methods

0 comments

The pith

Sequence probability predicts correctness across prompt-answer pairs in a dataset but not when changing decoding methods or within responses to one prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper quantifies how well the conditional probability of a full answer given a prompt aligns with whether that answer is correct. It tests this alignment across four levels: different decoding methods, different hyperparameters inside one method, different prompt-answer pairs inside one dataset, and multiple answers generated for the identical prompt. A positive link appears when comparing answers across a fixed dataset, yet the link disappears when probability is raised by switching methods or settings, and it also fails to separate correct from incorrect answers to any single prompt. The results therefore limit how much correctness can be expected to rise simply by steering generation toward higher-probability sequences.

Core claim

The paper establishes that sequence probability aligns with correctness when comparing different prompt-answer pairs inside one dataset, yet this alignment does not extend to choices made by varying decoding hyperparameters or methods, nor does it hold for selecting among responses to a single prompt.

What carries the argument

Sequence probability (the model's conditional probability of an entire continuation given the prompt) and its measured correlation with correctness at four distinct levels of comparison.

If this is right

Decoding methods that increase sequence probability cannot be assumed to raise accuracy.
Hyperparameter changes that raise sequence probability do not reliably improve performance.
Sequence probability cannot be used to pick the correct answer among multiple generations for the same prompt.
Self-consistency and verifier-free self-improvement techniques that rely on probability need to account for these limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that directly encourage probability to track correctness may be needed rather than relying on post-training decoding adjustments.
Signals other than sequence probability, such as cross-sample consistency, could be combined with or replace probability for answer selection.
Dataset-specific thresholds on probability might still allow rough correctness prediction even if they do not guide decoding.

Load-bearing premise

The models, benchmarks, and decoding methods tested are representative enough that the observed probability-correctness patterns apply beyond the specific cases examined.

What would settle it

A controlled experiment in which a new decoding method or hyperparameter setting raises sequence probability on multiple benchmarks and also raises accuracy would falsify the claim that such increases do not reliably improve correctness.

Figures

Figures reproduced from arXiv: 2606.27359 by Johannes Zenn, Jonas Geiping.

**Figure 1.** Figure 1: The correlation between log-probability and correctness across three levels. Left: Across-method-correlation: log-probability plotted against accuracy for various decoding methods and hyperparameters (connected by lines). Within-method correlation: plotted without transparency. Middle: Within-dataset correlation: for one method (BoN with N = 32) we bin correct (green dots, top) and incorrect data (red dots… view at source ↗

**Figure 2.** Figure 2: Within-dataset correlation: log-probability and correctness are correlated for each dataset. Each panel scatters samples using log-probability and correctness. ρ: Spearman correlation coefficient. r: (binned) Pearson correlation coefficient. Correlation is strongest for MATH500 and smaller but positive for GPQA, HumanEval, MedQA, and MMLU. Only IFEval shows a negative correlation. Plots show Qwen3-8B-Base … view at source ↗

**Figure 3.** Figure 3: Within-dataset correlation: log-probability and correctness is consistently correlated across model families and datasets, largely independent of methods. Base models show more negative and diverse correlation and are consistently negative for IFEval. Posttrained models show consistently positive correlation. Correlation coefficients r are averaged over model sizes at a canonical hyperparameter for each me… view at source ↗

**Figure 4.** Figure 4: Within-method correlation: while sequences are achieving higher log-probability, one cannot predict correctness by changing a hyperparameter of the method. Correlations within methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets … view at source ↗

**Figure 5.** Figure 5: Across-method correlation: many methods produce samples that are both more probable and more accurate than the low-temperature sampling (LTS) baseline. However, correlations are not consistent across models and datasets. Correlations across methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and ben… view at source ↗

**Figure 6.** Figure 6: Within-dataset correlation: the correlation between log-probability and correctness increases with accuracy. Probability-based self-improvement loops would require the model to have sufficient accuracy. Accuracy is plotted against correlation across all datasets, methods, and models (left: base, right: posttrained, at canonical hyperparameter). We observe positive correlation (base: r = 0.66, posttrained: … view at source ↗

**Figure 7.** Figure 7: Color coding for [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Within-sample correlation: correlation coefficients are distributed symmetrically around zero (left, middle). Further (right), the more correct a set of continuations for a single prompt is, the more positive is its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Right: Correlation coefficient and … view at source ↗

**Figure 9.** Figure 9: Color coding for [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Power self-consistency improves over self-consistency on MATH500. On GPQA and MMLU power self-consistency degrades performance. Plot shows data for the Qwen3 model series across various number of samples k. See Section 3.4 for a discussion. thought to be more calibrated than posttrained models which does not seem to hold true here. On a different note, one can pose that for probability-based verifier-free… view at source ↗

**Figure 11.** Figure 11: Self-consistency with probability-weighted voting often underperforms majority voting. Power self-consistency does not shows consistent differences. Accuracy difference between (power) self-consistency using uniform or probability-weighted majority voting. Results on Qwen3 model series across three datasets. See Section 3.4 for details. weighting mostly degrades performance. This can be explained by Findi… view at source ↗

**Figure 12.** Figure 12: We observe consistent within-dataset correlation across model families and datasets largely independent of methods. Base models show more negative and diverse correlation and are consistently negative for IFEval. Posttrained models show consistently positive correlation. Correlation coefficients ρ averaged over model sizes at representative hyperparameter for each method. Methods are in the legend. Datase… view at source ↗

**Figure 13.** Figure 13: and [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Methods show either positive or negative correlation within their hyperparameter. Correlations within methods for local and global decoding methods, models (Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. Detailed discussion in Section 3.2. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: and [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Various datasets and methods show positive correlations between sequence probability and correctness. Correlations across methods for local and global decoding methods, models(Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. See Section 3.3. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Within-sample correlation coefficients are generally small with MATH500 consistently showing much larger values. Datasets are plotted horizontally on the bottom: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. In this section, we extend our analysis of Section 3.4 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: The more correct a sample is under repeated sampling, the stronger its correlation coefficient. We observe similar results across all model families (columns) and variants (rows). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Correlation coefficients are mostly symmetrically distributed around zero, the more correct a sample is, the more positive its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Correlation is computed within a sample across methods Right: Correlation coefficient and fraction of correct samples seems… view at source ↗

**Figure 20.** Figure 20: Correlation coefficients are mostly symmetrically distributed around zero, the more correct a sample is, the more positive its correlation. Per-sample rank correlation coefficient of Qwen3 base models (left) and posttrained models (middle) is distributed symmetrically with mean zero. Correlation is computed within a sample across methods Right: Correlation coefficient and fraction of correct samples seems… view at source ↗

**Figure 21.** Figure 21: Within-sample correlation coefficients are generally small with MATH500 consistently showing much larger values. Datasets are plotted horizontally on the bottom: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. We show our results in [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: and [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: The more correct a sample is under repeated sampling, the stronger its correlation coefficient. We observe similar results across all model families (columns) and variants (rows). Same as [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Beam search often achieves smallest log-probabilities; BoN, SPS, and power-SMC have larger log-probabilities. Log-probabilities averaged over model sizes. Typically, BoN, SPS, and power-SMC lead to smallest log-probabilities. LTS is typically larger and beam search is largest. Olmo3 seems to be an outlier for power-SMC where log-probabilities appear very small. Datasets are plotted horizontally on the bot… view at source ↗

**Figure 25.** Figure 25: Global decoding methods produce sequences of larger per-token log-probability that are smoother in posttrained models. For base models, LTS samples are shortest (top row), SPS and BoN samples appear longer. There is no meaningful difference between log-probability and smoothness. For posttrained models, we observe larger per-token log-probability and smoother trajectories for global decoding methods. Qwen… view at source ↗

**Figure 26.** Figure 26: Global decoding methods produce sequences of larger per-token log-probability that are smoother in posttrained models. For base models, LTS samples are shortest (top row), SPS and BoN samples appear longer. There is no meaningful difference between log-probability and smoothness. For posttrained models, we observe larger per-token log-probability and smoother trajectories for global decoding methods. Qwen… view at source ↗

**Figure 27.** Figure 27: we find that using per-token log-probability does not change the within-dataset correlation significantly when considering Spearman ρ. For Pearson r, many points get pushed to the corner; the off-diagonal elements seem to be consistent for ρ and r. −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 Spearman ρ (sum) −0.5 0.0 0.5 Spearman ρ (average) GPQA HumanEval IFEval MATH500 MedQA MMLU −1.0 −0.5 0.0 0.5 1.0 Pearson r (sum)… view at source ↗

**Figure 28.** Figure 28: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. Within-method correlations. C… view at source ↗

**Figure 29.** Figure 29: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Qwen2.5 series 8B as ∗1, Math 8B as ∗2 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. B1 P1 +/+ -/+ +/+ +/- -/+ +/++/+ -/- +/++/+ -/+ +/… view at source ↗

**Figure 30.** Figure 30: Methods show either positive or negative correlation within their hyperparameter using per-token log-probability. Correlations within methods for local and global decoding methods, models (Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_30.png] view at source ↗

**Figure 31.** Figure 31: Various datasets and methods show positive correlations between sequence probability and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models (Qwen3 series 0.6B as ∗1, 1.7B as ∗2, 4B as ∗3, and 8B as ∗4 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. B1 B2 P1… view at source ↗

**Figure 32.** Figure 32: Various datasets and methods show positive correlations between sequence probability and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models (Qwen2.5 series 8B as ∗1, Math 8B as ∗2 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 33 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 33.** Figure 33: Various datasets and methods show positive correlations between sequence probability and correctness using per-token log-probability. Correlations across methods for local and global decoding methods, models(Olmo3 series 7B as ∗1 with base as B∗ and posttrained model as P∗), and benchmark datasets. Correlations are not consistent across datasets and models. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_33.png] view at source ↗

**Figure 34.** Figure 34: and [PITH_FULL_IMAGE:figures/full_fig_p035_34.png] view at source ↗

**Figure 35.** Figure 35: Sequence lengths overlap across methods, beam search is the only exception. Data for Qwen3-8B. Details in Appendix E.4. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_35.png] view at source ↗

**Figure 36.** Figure 36: Left: Comparing mode and power-distribution we find no consistent differences in accuracy. Middle, right: BoN typically leads to shorter samples that are of smaller probability. Datasets are plotted horizontally on the bottom of each panel: GPQA, Humaneval, IFEval, MATH500, MedQA, MMLU. Qwen3 model family. Comparing correctness across BoN and power-distribution in [PITH_FULL_IMAGE:figures/full_fig_p036_36.png] view at source ↗

**Figure 37.** Figure 37: Self-distillation shows diverse results across methods. On MATH500 power selfdistillation seems to often improve results. Else, results are mixed. Qwen3 base models. In [PITH_FULL_IMAGE:figures/full_fig_p037_37.png] view at source ↗

**Figure 38.** Figure 38: Left: Qwen3 thinking mostly improves on Humaneval, MMLU, and MedQA; the strongest decline can be observed on IFEval. Across methods, beam search often performs worst. Right: Thinking models often reach the token limit. The mean sequence length for thinking models is much higher as compared to non-thinking models. In [PITH_FULL_IMAGE:figures/full_fig_p038_38.png] view at source ↗

read the original abstract

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sequence probability tracks correctness across different prompts in a dataset but fails to guide decoding changes or repeated answers to one prompt, with the main open question being how far the tested models and tasks reach.

read the letter

The main takeaway is that higher sequence probability often lines up with correct answers when you compare across different prompt-answer pairs in one dataset. That link breaks when you change decoding methods or hyperparameters to raise probability, and it also fails to flag correctness among multiple responses to the same prompt.

The paper organizes this at four explicit levels: across methods, within-method hyperparameters, across prompt-answer pairs, and across repeats for a fixed prompt. That structure turns some existing observations into clearer practical guidance on when likelihood-based decoding can be expected to help and when it cannot. The multi-level empirical check is the part that adds value.

The experiments cover multiple models, benchmarks, and decoding approaches, which supports the claims as far as they go. The negative results on transfer and within-prompt use are the sharper findings.

The soft spot is representativeness. The statements use 'often' and 'generally,' so they depend on whether the chosen models, sizes, and tasks are diverse enough that the patterns are not tied to the specific slice tested. If the coverage is narrow, the transfer failure could look different on larger models or open-ended tasks.

This is for people working on LLM inference, decoding, and self-consistency. A reader in that area gets usable limits from the breakdown. It deserves peer review because the question is practical and the four-level approach is a direct way to address it.

Referee Report

2 major / 1 minor

Summary. The manuscript examines the alignment between sequence probability (conditional probability of a continuation given a prompt) and correctness in LLMs. It quantifies the relationship at four levels—across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt—finding that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset, but that increasing sequence probability via changes to hyperparameters or methods does not reliably improve accuracy, and that sequence probability is not a reliable indicator of correctness for responses to the same prompt. The work aims to clarify when decoding methods can and cannot be expected to improve correctness.

Significance. If the reported patterns hold under broader testing, the multi-level empirical analysis provides useful practical guidance for decoding choices, self-consistency, and verifier-free self-improvement in LLMs by delineating where probability-correctness alignment exists and where it fails to transfer. The four-level breakdown is a strength, as it separates within-dataset correlations from causal effects of decoding interventions.

major comments (2)

[Abstract] Abstract: The headline claims use qualifiers such as 'often predictive' and 'does not generally transfer.' These rest on the assumption that the tested models, benchmarks, and decoding methods are representative; however, the abstract provides no enumeration of the specific models (scale, families), benchmarks (task types), or methods, making it impossible to assess whether the negative transfer result is an artifact of a narrow experimental slice (e.g., only small models or multiple-choice tasks).
[Abstract and experimental sections] The central empirical claims at the 'across decoding methods' and 'across hyperparameters' levels require evidence that the observed lack of accuracy improvement when sequence probability is increased is not driven by post-hoc benchmark or method selection. Without explicit reporting of the full set of methods, models, and controls (including statistical tests and exclusion criteria), the 'does not generally transfer' conclusion cannot be evaluated for robustness.

minor comments (1)

[Introduction] Clarify the precise definition of 'sequence probability' (e.g., whether it is the product of token probabilities or normalized) with an equation early in the paper to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our experimental scope. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims use qualifiers such as 'often predictive' and 'does not generally transfer.' These rest on the assumption that the tested models, benchmarks, and decoding methods are representative; however, the abstract provides no enumeration of the specific models (scale, families), benchmarks (task types), or methods, making it impossible to assess whether the negative transfer result is an artifact of a narrow experimental slice (e.g., only small models or multiple-choice tasks).

Authors: We agree that the abstract should briefly enumerate the experimental scope to support the qualifiers. In revision we will add a concise clause listing the model families and scales, benchmark task types (including both multiple-choice and generative), and decoding methods tested, while preserving length constraints. revision: yes
Referee: [Abstract and experimental sections] The central empirical claims at the 'across decoding methods' and 'across hyperparameters' levels require evidence that the observed lack of accuracy improvement when sequence probability is increased is not driven by post-hoc benchmark or method selection. Without explicit reporting of the full set of methods, models, and controls (including statistical tests and exclusion criteria), the 'does not generally transfer' conclusion cannot be evaluated for robustness.

Authors: The body already reports results aggregated over the full set of models, benchmarks, and methods used. To address the concern directly we will add an appendix that lists every configuration considered, states inclusion/exclusion criteria, and reports the statistical tests applied to accuracy differences. This will make the robustness of the 'does not generally transfer' finding transparent without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study

full rationale

The paper conducts an empirical analysis of sequence probability vs. correctness across decoding methods, models, benchmarks, and prompt levels. It reports observational patterns from experiments without any derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce claims to inputs by construction. No self-definitional, fitted-input, or uniqueness-theorem patterns appear. The central claims rest on direct measurement rather than reduction to prior fitted quantities or author-specific ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on standard empirical correlation and comparison methods in machine learning.

axioms (1)

standard math Standard statistical assumptions underlying correlation and predictive-power measurements hold for the chosen benchmarks and models.
The reported predictive relationships presuppose that conventional statistical tools apply without hidden biases in the LLM output distributions.

pith-pipeline@v0.9.1-grok · 5719 in / 1269 out tokens · 29193 ms · 2026-06-26T01:53:17.316822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 14 linked inside Pith

[1]

Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning

Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, and Massoud Pedram. Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning. arXiv preprint arXiv:2602.10273, 2026

arXiv 2026
[2]

Smoothing algorithms for state--space models

Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state--space models. Annals of the Institute of Statistical Mathematics, 62: 0 61--89, 2010

2010
[3]

Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024 a

arXiv 2024
[4]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code, 2021. arXiv preprint arXiv:2107.03374, 2025

Pith/arXiv arXiv 2021
[5]

Universal self-consistency for large language models

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. In International Conference on Machine Learning, Workshop on In-Context Learning, 2024 b

2024
[6]

Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference

Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference . The Annals of Statistics, 32 0 (6): 0 2385 -- 2411, 2004

2004
[7]

An introduction to sequential Monte Carlo, volume 4

Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

2020
[8]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[9]

Empirical analysis of beam search performance degradation in neural sequence models

Eldan Cohen and Christopher Beck. Empirical analysis of beam search performance degradation in neural sequence models. In International Conference on Machine Learning, 2019

2019
[10]

Feynman-Kac Formulae, pages 47--93

Pierre Del Moral. Feynman-Kac Formulae, pages 47--93. Springer New York, 2004

2004
[11]

Sequential monte carlo samplers

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 0 (3): 0 411--436, 2006

2006
[12]

An introduction to sequential monte carlo methods

Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

2001
[13]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation . In Iryna Gurevych and Yusuke Miyao, editors, Annual Meeting of the Association for Computational Linguistics , Long Papers , 2018

2018
[14]

Deep Think with Confidence

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep Think with Confidence . arXiv preprint arXiv:2508.15260, 2025

Pith/arXiv arXiv 2025
[15]

J. Gai, G. Zeng, H. Zhang, and A. Raghunathan. Differential smoothing mitigates sharpening and improves LLM reasoning. arXiv preprint arXiv:2511.19942, 2025

arXiv 2025
[16]

W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57 0 (1): 0 97--109, 1970

1970
[17]

He, Daniel Fried, and Sean Welleck

Amy W. He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Conference on Empirical Methods in Natural Language Processing, 2025

2025
[18]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021
[19]

Manning, and Percy Liang

John Hewitt, Christopher D. Manning, and Percy Liang. Truncation Sampling as Language Model Desmoothing . arXiv preprint arXiv:2210.15191, 2022

arXiv 2022
[20]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[21]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration . arXiv preprint arXiv:1904.09751, 2020

Pith/arXiv arXiv 1904
[22]

Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, and Haitham Bou Ammar. Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening. arXiv preprint arXiv:2601.21590, 2026

arXiv 2026
[23]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

2021
[24]

Language models (mostly) know what they know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

Pith/arXiv arXiv 2022
[25]

Scalable Best -of- N Selection for Large Language Models via Self - Certainty

Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable Best -of- N Selection for Large Language Models via Self - Certainty . arXiv preprint arXiv:2502.18581, 2025

arXiv 2025
[26]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

Pith/arXiv arXiv 2025
[27]

Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models

Tom Kempton and Stuart Burrell. Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models. arXiv preprint arXiv:2503.21929, 2025

arXiv 2025
[28]

Six Challenges for Neural Machine Translation

Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation . In Annual Meeting of the Association for Computational Linguistics, Workshop on Neural Machine Translation , 2017

2017
[29]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Symposium on Operating Systems Principles, 2023

2023
[30]

Sixo: Smoothing inference with twisted objectives

Dieterich Lawson, Allan Ravent \'o s, Andrew Warrington, and Scott Linderman. Sixo: Smoothing inference with twisted objectives. In Advances in Neural Information Processing Systems, 2022

2022
[31]

Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s

Xueyan Li, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s. In International Conference on Learning Representations, 2026 a

2026
[32]

-leaf enumeration: Non-repeating self-consistency via truncated tree search

Xueyan Li, Johannes Zenn, Ekaterina Fadeeva, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. -leaf enumeration: Non-repeating self-consistency via truncated tree search. International Conference on Learning Representations (ICLR), Workshop on Latent & Implicit Thinking, 2026 b

2026
[33]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024

2024
[34]

Critic sequential monte carlo

Vasileios Lioutas, Jonathan Wilder Lavington, Justice Sefas, Matthew Niedoba, Yunpeng Liu, Berend Zwartsenberg, Setareh Dabiri, Frank Wood, and Adam Scibior. Critic sequential monte carlo. In International Conference on Learning Representations, 2022

2022
[35]

The harpy speech recognition system

Bruce T Lowerre. The harpy speech recognition system. Carnegie Mellon University, 1976

1976
[36]

Equation of state calculations by fast computing machines

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21 0 (6): 0 1087--1092, 1953

1953
[37]

K. Ni, Z. Tan, Z. Liu, P. Li, and T. Chen. Can GRPO help LLMs transcend their pretraining origin? arXiv preprint arXiv:2510.15990, 2025

arXiv 2025
[38]

Karl Pearson. Iii. contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. (A.), 0 (185): 0 71--110, 12 1894
[39]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In Conference on Language Modeling, 2024

2024
[40]

Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for LLM reasoning. arXiv preprint arXiv:2509.06941, 2025

arXiv 2025
[41]

Spearman

C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15 0 (1): 0 72--101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159

arXiv 1904
[42]

Calibration and correctness of language models for code

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and correctness of language models for code. In International Conference on Software Engineering, 2025

2025
[43]

A contrastive framework for neural text generation

Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems, 2022

2022
[44]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 2014

2014
[45]

Confidence improves self-consistency in LLM s

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in LLM s. In Findings of the Association for Computational Linguistics, 2025

2025
[46]

Diverse beam search: Decoding diverse solutions from neural sequence models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424, 2016

Pith/arXiv arXiv 2016
[47]

Arithmetic sampling: parallel diverse decoding for large language models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Tachard Passos, and Sumit Sanghai. Arithmetic sampling: parallel diverse decoding for large language models. In International Conference on Machine Learning, 2023

2023
[48]

Soft Self-Consistency Improves Language Models Agents

Han Wang, Archiki Prasad, Elias Stengel-Eskin , and Mohit Bansal. Soft Self-Consistency Improves Language Models Agents . In Annual Meeting of the Association for Computational Linguistics , Short Papers , 2024 a

2024
[49]

Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation

Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation. In Annual Meeting of the Association for Computational Linguistics, Long Papers, 2024 b

2024
[50]

Self- Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self- Consistency Improves Chain of Thought Reasoning in Language Models . arXiv preprint arXiv:2203.11171, 2023

Pith/arXiv arXiv 2023
[51]

MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark . arXiv preprint arXiv:2406.01574, 2024 c

Pith/arXiv arXiv 2024
[52]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025
[53]

Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation

Yilin Yang, Liang Huang, and Mingbo Ma. Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation . In Conference on Empirical Methods in Natural Language Processing , 2018

2018
[54]

Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv preprint arXiv:2504.13837, 2025

Pith/arXiv arXiv 2025
[55]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, 2022

2022
[56]

Be your own teacher: Improve the performance of convolutional neural networks via self distillation

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In International Conference on Computer Vision, 2019

2019
[57]

Probabilistic inference in language models via twisted sequential monte carlo

Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference in language models via twisted sequential monte carlo. In International Conference on Machine Learning, 2024

2024
[58]

Instruction-following evaluation for large language models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

Pith/arXiv arXiv 2023

[1] [1]

Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning

Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, and Massoud Pedram. Power-smc: Low-latency sequence-level power sampling for training-free llm reasoning. arXiv preprint arXiv:2602.10273, 2026

arXiv 2026

[2] [2]

Smoothing algorithms for state--space models

Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state--space models. Annals of the Institute of Statistical Mathematics, 62: 0 61--89, 2010

2010

[3] [3]

Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282, 2024 a

arXiv 2024

[4] [4]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code, 2021. arXiv preprint arXiv:2107.03374, 2025

Pith/arXiv arXiv 2021

[5] [5]

Universal self-consistency for large language models

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. In International Conference on Machine Learning, Workshop on In-Context Learning, 2024 b

2024

[6] [6]

Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference

Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference . The Annals of Statistics, 32 0 (6): 0 2385 -- 2411, 2004

2004

[7] [7]

An introduction to sequential Monte Carlo, volume 4

Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

2020

[8] [8]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[9] [9]

Empirical analysis of beam search performance degradation in neural sequence models

Eldan Cohen and Christopher Beck. Empirical analysis of beam search performance degradation in neural sequence models. In International Conference on Machine Learning, 2019

2019

[10] [10]

Feynman-Kac Formulae, pages 47--93

Pierre Del Moral. Feynman-Kac Formulae, pages 47--93. Springer New York, 2004

2004

[11] [11]

Sequential monte carlo samplers

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 0 (3): 0 411--436, 2006

2006

[12] [12]

An introduction to sequential monte carlo methods

Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

2001

[13] [13]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation . In Iryna Gurevych and Yusuke Miyao, editors, Annual Meeting of the Association for Computational Linguistics , Long Papers , 2018

2018

[14] [14]

Deep Think with Confidence

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep Think with Confidence . arXiv preprint arXiv:2508.15260, 2025

Pith/arXiv arXiv 2025

[15] [15]

J. Gai, G. Zeng, H. Zhang, and A. Raghunathan. Differential smoothing mitigates sharpening and improves LLM reasoning. arXiv preprint arXiv:2511.19942, 2025

arXiv 2025

[16] [16]

W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57 0 (1): 0 97--109, 1970

1970

[17] [17]

He, Daniel Fried, and Sean Welleck

Amy W. He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Conference on Empirical Methods in Natural Language Processing, 2025

2025

[18] [18]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021

[19] [19]

Manning, and Percy Liang

John Hewitt, Christopher D. Manning, and Percy Liang. Truncation Sampling as Language Model Desmoothing . arXiv preprint arXiv:2210.15191, 2022

arXiv 2022

[20] [20]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[21] [21]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration . arXiv preprint arXiv:1904.09751, 2020

Pith/arXiv arXiv 1904

[22] [22]

Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, and Haitham Bou Ammar. Scalable power sampling: Unlocking efficient, training-free reasoning for llms via distribution sharpening. arXiv preprint arXiv:2601.21590, 2026

arXiv 2026

[23] [23]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

2021

[24] [24]

Language models (mostly) know what they know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

Pith/arXiv arXiv 2022

[25] [25]

Scalable Best -of- N Selection for Large Language Models via Self - Certainty

Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable Best -of- N Selection for Large Language Models via Self - Certainty . arXiv preprint arXiv:2502.18581, 2025

arXiv 2025

[26] [26]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

Pith/arXiv arXiv 2025

[27] [27]

Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models

Tom Kempton and Stuart Burrell. Local normalization distortion and the thermodynamic formalism of decoding strategies for large language models. arXiv preprint arXiv:2503.21929, 2025

arXiv 2025

[28] [28]

Six Challenges for Neural Machine Translation

Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation . In Annual Meeting of the Association for Computational Linguistics, Workshop on Neural Machine Translation , 2017

2017

[29] [29]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Symposium on Operating Systems Principles, 2023

2023

[30] [30]

Sixo: Smoothing inference with twisted objectives

Dieterich Lawson, Allan Ravent \'o s, Andrew Warrington, and Scott Linderman. Sixo: Smoothing inference with twisted objectives. In Advances in Neural Information Processing Systems, 2022

2022

[31] [31]

Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s

Xueyan Li, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. Sample smart, not hard: Correctness-first decoding for better reasoning in LLM s. In International Conference on Learning Representations, 2026 a

2026

[32] [32]

-leaf enumeration: Non-repeating self-consistency via truncated tree search

Xueyan Li, Johannes Zenn, Ekaterina Fadeeva, Guinan Su, Mrinmaya Sachan, and Jonas Geiping. -leaf enumeration: Non-repeating self-consistency via truncated tree search. International Conference on Learning Representations (ICLR), Workshop on Latent & Implicit Thinking, 2026 b

2026

[33] [33]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024

2024

[34] [34]

Critic sequential monte carlo

Vasileios Lioutas, Jonathan Wilder Lavington, Justice Sefas, Matthew Niedoba, Yunpeng Liu, Berend Zwartsenberg, Setareh Dabiri, Frank Wood, and Adam Scibior. Critic sequential monte carlo. In International Conference on Learning Representations, 2022

2022

[35] [35]

The harpy speech recognition system

Bruce T Lowerre. The harpy speech recognition system. Carnegie Mellon University, 1976

1976

[36] [36]

Equation of state calculations by fast computing machines

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21 0 (6): 0 1087--1092, 1953

1953

[37] [37]

K. Ni, Z. Tan, Z. Liu, P. Li, and T. Chen. Can GRPO help LLMs transcend their pretraining origin? arXiv preprint arXiv:2510.15990, 2025

arXiv 2025

[38] [38]

Karl Pearson. Iii. contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. (A.), 0 (185): 0 71--110, 12 1894

[39] [39]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In Conference on Language Modeling, 2024

2024

[40] [40]

Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for LLM reasoning. arXiv preprint arXiv:2509.06941, 2025

arXiv 2025

[41] [41]

Spearman

C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15 0 (1): 0 72--101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159

arXiv 1904

[42] [42]

Calibration and correctness of language models for code

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and correctness of language models for code. In International Conference on Software Engineering, 2025

2025

[43] [43]

A contrastive framework for neural text generation

Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems, 2022

2022

[44] [44]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 2014

2014

[45] [45]

Confidence improves self-consistency in LLM s

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in LLM s. In Findings of the Association for Computational Linguistics, 2025

2025

[46] [46]

Diverse beam search: Decoding diverse solutions from neural sequence models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424, 2016

Pith/arXiv arXiv 2016

[47] [47]

Arithmetic sampling: parallel diverse decoding for large language models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Tachard Passos, and Sumit Sanghai. Arithmetic sampling: parallel diverse decoding for large language models. In International Conference on Machine Learning, 2023

2023

[48] [48]

Soft Self-Consistency Improves Language Models Agents

Han Wang, Archiki Prasad, Elias Stengel-Eskin , and Mohit Bansal. Soft Self-Consistency Improves Language Models Agents . In Annual Meeting of the Association for Computational Linguistics , Short Papers , 2024 a

2024

[49] [49]

Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation

Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation. In Annual Meeting of the Association for Computational Linguistics, Long Papers, 2024 b

2024

[50] [50]

Self- Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self- Consistency Improves Chain of Thought Reasoning in Language Models . arXiv preprint arXiv:2203.11171, 2023

Pith/arXiv arXiv 2023

[51] [51]

MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark . arXiv preprint arXiv:2406.01574, 2024 c

Pith/arXiv arXiv 2024

[52] [52]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025

[53] [53]

Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation

Yilin Yang, Liang Huang, and Mingbo Ma. Breaking the Beam Search Curse : A Study of ( Re -) Scoring Methods and Stopping Criteria for Neural Machine Translation . In Conference on Empirical Methods in Natural Language Processing , 2018

2018

[54] [54]

Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv preprint arXiv:2504.13837, 2025

Pith/arXiv arXiv 2025

[55] [55]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, 2022

2022

[56] [56]

Be your own teacher: Improve the performance of convolutional neural networks via self distillation

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In International Conference on Computer Vision, 2019

2019

[57] [57]

Probabilistic inference in language models via twisted sequential monte carlo

Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference in language models via twisted sequential monte carlo. In International Conference on Machine Learning, 2024

2024

[58] [58]

Instruction-following evaluation for large language models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

Pith/arXiv arXiv 2023