arxiv: 2510.14901 · v1 · submitted 2025-10-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan , Yilun Du

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reasoningsamplingbase modelsreinforcement learninginference timelanguage modelsMATH500HumanEval

0 comments

The pith

A simple iterative sampling algorithm using only a base model's likelihoods can elicit reasoning performance that nearly matches or exceeds reinforcement learning on tasks like math and coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether strong reasoning in language models must come from reinforcement learning during training or whether it can be drawn from the base model through sampling alone. It introduces an iterative algorithm inspired by Markov chain Monte Carlo methods that repeatedly reshapes outputs using the model's own probability estimates. When tested on benchmarks including MATH500, HumanEval, and GPQA, the approach produces large gains that rival or surpass those from RL post-training. The method also avoids the drop in solution diversity that typically occurs with RL models across repeated samples. Because it needs no training, curated data, or external verifier, the technique points to a lightweight way to improve single-shot reasoning across many domains.

Core claim

The paper shows that an MCMC-style iterative sampler operating on a base model's likelihoods generates substantially better reasoning trajectories on single-shot tasks than standard sampling, achieving results that nearly match and sometimes exceed those obtained from reinforcement learning post-training while preserving greater output diversity.

What carries the argument

An iterative sampling procedure inspired by Markov chain Monte Carlo that repeatedly uses the base model's own likelihoods to reshape and refine reasoning outputs without any external guidance.

If this is right

Reasoning gains on math, coding, and science questions become available at inference time without any model updates.
Generated solutions maintain higher diversity across repeated draws than is typical after reinforcement learning.
The same procedure works on multiple base models and across tasks that do not require hand-curated data.
No external verifier or reward model is needed to obtain the reported performance lifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, reinforcement learning may largely be surfacing and amplifying capabilities already latent in base models rather than installing entirely new ones.
The approach could be tested on open-ended or non-verifiable tasks where traditional RL is harder to apply.
Combining the sampler with other inference-time techniques such as self-consistency or tree search might produce further additive gains.
Model developers could focus more on base pretraining quality if sampling methods prove sufficient for many reasoning needs.

Load-bearing premise

The base model's likelihoods contain enough useful signal to let a simple iterative sampler turn them into higher-quality reasoning paths without training or an external verifier.

What would settle it

Running the iterative sampler on a base model for the MATH500 benchmark and finding no gain or a clear drop in accuracy compared with ordinary sampling would falsify the central claim.

read the original abstract

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an MCMC-inspired iterative sampling procedure that reshapes reasoning trajectories at inference time by using only the base LLM's token-level likelihoods for acceptance decisions. The central claim is that this training-free method produces substantial accuracy gains on single-shot reasoning benchmarks (MATH500, HumanEval, GPQA) that nearly match or exceed those obtained from RL post-training, while preserving sample diversity and requiring neither curated data nor an external verifier.

Significance. If the empirical results and the underlying assumption about likelihood signal hold, the work would be significant: it would demonstrate that advanced reasoning behaviors are already latent in base models and can be elicited by inference-time sampling rather than expensive RL, while avoiding the diversity collapse typical of RL-tuned models. The absence of a verifier requirement would also broaden applicability to domains without easy verification.

major comments (2)

[§3] §3 (Algorithm description): The acceptance step is defined solely via likelihood ratios drawn from the base model. On MATH500, HumanEval, and GPQA, however, correct solutions are characteristically longer and contain lower-probability token sequences than fluent but incorrect answers; an acceptance rule based only on the model's own likelihoods therefore risks preferentially retaining incorrect trajectories. The manuscript must supply either a direct correlation analysis between likelihood and correctness or an ablation that isolates the effect of the acceptance criterion to establish that the likelihood signal is aligned with accuracy rather than anti-aligned.
[§4 and §5] §4 and §5 (Experimental results): The abstract states that the sampler 'nearly match[es] and even outperform[s]' RL on MATH500, HumanEval, and GPQA, yet the reported tables and figures lack error bars, statistical significance tests, and controls for prompt length, temperature, and number of MCMC steps. Without these, it is impossible to determine whether the claimed gains are robust or whether they could be explained by differences in effective compute or sampling budget.

minor comments (2)

The pseudocode for the iterative sampler is referenced but not shown in a single, self-contained listing; adding an explicit algorithm box would improve reproducibility.
Figure captions should explicitly state the number of samples drawn per method and the exact temperature schedule used for the base-model baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Algorithm description): The acceptance step is defined solely via likelihood ratios drawn from the base model. On MATH500, HumanEval, and GPQA, however, correct solutions are characteristically longer and contain lower-probability token sequences than fluent but incorrect answers; an acceptance rule based only on the model's own likelihoods therefore risks preferentially retaining incorrect trajectories. The manuscript must supply either a direct correlation analysis between likelihood and correctness or an ablation that isolates the effect of the acceptance criterion to establish that the likelihood signal is aligned with accuracy rather than anti-aligned.

Authors: We appreciate this insightful observation regarding the potential misalignment between likelihood and correctness. While it is true that correct solutions can be longer and thus lower probability under the base model, our iterative sampling procedure is designed to explore multiple trajectories and accept based on relative likelihoods, which empirically leads to better reasoning performance. To directly address the referee's request, we will include in the revised version a correlation analysis plotting the model's likelihood against solution correctness on the benchmarks, as well as an ablation study comparing the full sampler to a version without the acceptance criterion (i.e., pure sampling). This will demonstrate that the acceptance step contributes positively to accuracy. revision: yes
Referee: [§4 and §5] §4 and §5 (Experimental results): The abstract states that the sampler 'nearly match[es] and even outperform[s]' RL on MATH500, HumanEval, and GPQA, yet the reported tables and figures lack error bars, statistical significance tests, and controls for prompt length, temperature, and number of MCMC steps. Without these, it is impossible to determine whether the claimed gains are robust or whether they could be explained by differences in effective compute or sampling budget.

Authors: We agree that the experimental section would benefit from greater statistical detail to substantiate the robustness of our results. In the revised manuscript, we will add error bars to all reported metrics, computed over multiple independent runs with different random seeds. We will also include statistical significance tests (e.g., paired t-tests) comparing our sampler to the baselines. Additionally, we will provide controls by varying prompt lengths, temperatures, and the number of MCMC steps, and report performance as a function of these parameters to rule out explanations based solely on increased sampling budget or compute. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and claims are independently specified

full rationale

The paper defines a straightforward iterative MCMC-style sampler that directly consumes the base model's token likelihoods as the sole source of signal, without any parameter fitting to the evaluation tasks, without renaming an empirical pattern as a derivation, and without load-bearing self-citations or uniqueness theorems. The algorithm is presented as a direct application of existing sampling ideas to LLM logits; the reported gains on MATH500, HumanEval, and GPQA are obtained by running the sampler on external benchmarks rather than by construction from the same quantities used to define the procedure. No equation or step reduces the claimed improvement to a tautological restatement of the input likelihoods or to prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that base-model likelihoods already encode the necessary reasoning signal and can be iteratively refined by sampling without external supervision or training.

axioms (1)

domain assumption Base LLMs contain latent reasoning capabilities that can be surfaced by reshaping their output distribution through repeated sampling from their own likelihoods.
This assumption is required for the MCMC-inspired sampler to improve performance without any parameter updates.

pith-pipeline@v0.9.0 · 5503 in / 1192 out tokens · 29514 ms · 2026-05-17T03:11:43.647449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods... Metropolis-Hastings... A(x,x_i) = min{1, p^α(x) q(x_i|x) / p^α(x_i) q(x|x_i)}
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sampling from p^α encourages sampling tokens which have fewer but higher likelihood future paths

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Large Language Models in Scientific Discovery
cs.AI 2025-12 unverdicted novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
cs.LG 2026-05 unverdicted novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
cs.LG 2026-04 unverdicted novelty 7.0

Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
cs.CV 2026-03 unverdicted novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
cs.LG 2026-01 unverdicted novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
cs.AI 2026-05 unverdicted novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
The Role of Generator Access in Autoregressive Post-Training
cs.LG 2026-04 unverdicted novelty 5.0

Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and pro...
Position: agentic AI orchestration should be Bayes-consistent
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.