Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Andrew Zhao; Gao Huang; Rui Lu; Shiji Song; Yang Yue; Zhaokai Wang; Zhiqi Chen

arxiv: 2504.13837 · v5 · submitted 2025-04-18 · 💻 cs.AI · cs.CL· cs.CV

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue , Zhiqi Chen , Rui Lu , Andrew Zhao , Zhaokai Wang , Shiji Song , Gao Huang This is my paper

Pith reviewed 2026-05-10 23:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords reinforcement learning with verifiable rewardslarge language modelsreasoningpass@kbase modeldistillationmathematics benchmarkscoding benchmarks

0 comments

The pith

Reinforcement learning with verifiable rewards improves small-k performance but does not create new reasoning patterns beyond the base model's capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether RLVR training allows large language models to develop genuinely novel reasoning abilities not present in their base versions. Across math, coding, and visual benchmarks, RLVR models beat their bases on single-attempt success but the bases recover more total solutions when many attempts are permitted. Coverage and perplexity measurements show that every reasoning strategy observed after RLVR already exists in the base model's output distribution. Multiple popular RLVR algorithms reach similar levels of performance, all well below the base model's full potential. Distillation, by contrast, succeeds in adding new patterns from a stronger teacher.

Core claim

RLVR-trained models do not elicit fundamentally new reasoning patterns. While they outperform base models at small k, the base models achieve higher pass@k scores when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. Distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities.

What carries the argument

Large-k pass@k evaluation together with coverage analysis to determine whether any reasoning pattern lies outside the base model's sampling distribution.

If this is right

RLVR training does not expand the set of problems an LLM can solve beyond those solvable by its base model.
Six common RLVR algorithms deliver comparable performance and none fully exploits the base model's latent capabilities.
Distillation from a stronger teacher model can add reasoning patterns absent from the base model.
Current RLVR methods fall short of the self-improvement that reinforcement learning is expected to provide for reasoning tasks.
Paradigms such as continual scaling or multi-turn agent interaction may be needed to move past the base-model bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Purely sampling-based techniques that avoid RL training altogether could match or exceed RLVR gains on pass@1 without any parameter updates.
The apparent progress from RLVR on standard benchmarks may largely reflect better exploitation of existing knowledge rather than capability growth.
Hybrid methods that pair RL with explicit mechanisms to surface low-probability base-model outputs could test whether the current bound is fundamental.

Load-bearing premise

That any reasoning pattern never produced by the base model even after extremely large numbers of samples is genuinely unavailable rather than simply too rare to observe.

What would settle it

An RLVR model correctly solving a problem instance that the matched base model fails to solve after more than one million independent samples would contradict the claim that no new patterns are introduced.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Base models reach higher pass@k at large k than their RLVR versions, which suggests current RL mostly improves sampling efficiency instead of adding new reasoning patterns.

read the letter

The paper's main result is that RLVR fine-tunes beat their base models at k=1 but lose to them once k grows large. Coverage and perplexity checks point to the same conclusion: the RL models are not surfacing reasoning steps absent from the base distribution. Distillation, by contrast, does expand the set of reachable solutions. They show this pattern across several model families, six RL algorithms, and math, coding, and visual benchmarks. That consistency is the strongest part of the work. It gives a clear empirical counter to the common claim that RLVR enables unbounded self-improvement beyond the base model. The quantitative comparison of how far each algorithm falls short of the base-model upper bound is also useful for anyone trying to measure progress in this area. The central assumption, however, is that large-k sampling from the base model is exhaustive. If RLVR simply raises the probability of strategies that sit near zero in the base, even a large k may still miss them. The abstract does not report the exact k values used, the temperature settings, the number of independent samples per problem, or per-problem verification that every RL-correct answer appears in the base samples. Without those controls it is hard to know how much weight to give the “no new patterns” claim. Statistical tests on the pass@k gaps are also not mentioned. This work is aimed at researchers who build or scale RLVR systems for reasoning. Anyone who assumes RL will keep producing gains without new paradigms will find the numbers worth checking. The question is timely and the empirical approach is direct, so the paper deserves a serious referee. I would send it out, with the expectation that the authors add the missing sampling details and any statistical checks before final acceptance.

Referee Report

2 major / 3 minor

Summary. The paper empirically investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) elicits novel reasoning patterns in LLMs beyond those present in base models. Across multiple model families, six RL algorithms, and benchmarks in math, coding, and visual reasoning, the authors report that RLVR models outperform base models at small k (e.g., k=1) but underperform at large k on pass@k. Coverage and perplexity analyses are used to argue that reasoning abilities originate from and are bounded by the base model distribution. The work contrasts this with distillation, which does expand capabilities, and concludes that current RLVR remains far from optimal in leveraging base-model potential, calling for new paradigms such as continual scaling or multi-turn interactions.

Significance. If the central empirical pattern holds, the result would meaningfully temper claims that RLVR enables self-improvement and discovery of new reasoning strategies in LLMs, instead framing observed gains as reweighting of base-model capabilities. This has clear implications for research on scaling reasoning models. The systematic scope—spanning model families, algorithms, and task types—provides a useful broad empirical baseline, and the explicit comparison to distillation supplies a constructive contrast that highlights where capability expansion does occur.

major comments (2)

[coverage and perplexity analyses] The load-bearing claim that 'the observed reasoning abilities originate from and are bounded by the base model' (abstract and coverage/perplexity section) rests on large-k pass@k serving as an exhaustive upper bound. To substantiate that RLVR does not elicit new patterns, the analysis must verify that the specific solutions produced by RLVR models appear among base-model samples under identical temperature and decoding settings; higher aggregate pass@k alone does not rule out the possibility that RLVR shifts mass onto low-probability strategies that large-k sampling simply fails to surface. A per-problem overlap metric or explicit recovery check would directly address this.
[experimental setup and results] The manuscript reports consistent patterns across six RL algorithms but provides no details on statistical significance testing, exact data splits, or whether large-k sampling used identical temperature/decoding settings for base and RLVR models (as noted in the evaluation protocol). These omissions make it difficult to assess whether the reported pass@k gaps are robust or sensitive to sampling variance.

minor comments (3)

[evaluation metrics] Clarify the precise values of 'large k' employed in the pass@k curves and state the number of independent samples drawn per problem.
[introduction and abstract] The abstract states that 'six popular RLVR algorithms perform similarly'; the main text should list these algorithms explicitly with citations.
[figures] Figure captions and axis labels should indicate temperature, top-p, and whether greedy or stochastic decoding was used for the k=1 results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our empirical claims. We address each major point below and commit to revisions that strengthen the substantiation of our results without altering the core findings.

read point-by-point responses

Referee: [coverage and perplexity analyses] The load-bearing claim that 'the observed reasoning abilities originate from and are bounded by the base model' (abstract and coverage/perplexity section) rests on large-k pass@k serving as an exhaustive upper bound. To substantiate that RLVR does not elicit new patterns, the analysis must verify that the specific solutions produced by RLVR models appear among base-model samples under identical temperature and decoding settings; higher aggregate pass@k alone does not rule out the possibility that RLVR shifts mass onto low-probability strategies that large-k sampling simply fails to surface. A per-problem overlap metric or explicit recovery check would directly address this.

Authors: We agree that an explicit per-problem overlap or recovery analysis would provide stronger direct evidence that RLVR solutions lie within the base model's support. While our large-k pass@k results, combined with coverage and perplexity analyses, already indicate that RLVR primarily reweights existing patterns rather than introducing new ones, we will add a recovery check in the revised manuscript. Specifically, we will sample solutions from the base model under identical temperature and decoding settings and report the fraction of RLVR-generated correct solutions that are recoverable in the base model's samples on a per-problem basis. This will directly address the concern about low-probability strategies. revision: yes
Referee: [experimental setup and results] The manuscript reports consistent patterns across six RL algorithms but provides no details on statistical significance testing, exact data splits, or whether large-k sampling used identical temperature/decoding settings for base and RLVR models (as noted in the evaluation protocol). These omissions make it difficult to assess whether the reported pass@k gaps are robust or sensitive to sampling variance.

Authors: We appreciate this feedback on clarity. The evaluation protocol (Section 3.2) already specifies identical sampling parameters (temperature 0.7, top-p 0.95) for all models, but we will explicitly restate this equivalence for base and RLVR models in the revised text. We will also add details on the standard benchmark test splits used and include statistical significance measures (e.g., standard errors across multiple sampling runs or p-values for key pass@k differences) to demonstrate robustness. These additions will be incorporated without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation chain

full rationale

The paper reports direct experimental comparisons of pass@k (small and large k) between RLVR models and base models across benchmarks, supplemented by coverage and perplexity measurements. The central claim that reasoning abilities originate from and are bounded by the base model follows from these observed scores rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or first-principles steps are present that could reduce to inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical measurement study. It relies on standard definitions of pass@k and perplexity (standard_math) and the domain assumption that large-k coverage approximates the model's full reasoning support. No free parameters are fitted to produce the central claim and no new entities are postulated.

axioms (1)

domain assumption pass@k at sufficiently large k measures the model's total reasoning capacity
Invoked when treating base-model large-k performance as an upper bound on RLVR capability.

pith-pipeline@v0.9.0 · 5609 in / 1209 out tokens · 46260 ms · 2026-05-10T23:11:41.118401+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Evaluating Large Language Models in Scientific Discovery
cs.AI 2025-12 unverdicted novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
cs.CR 2025-09 conditional novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
cs.CR 2025-07 unverdicted novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Spurious Rewards: Rethinking Training Signals in RLVR
cs.AI 2025-06 accept novelty 8.0

Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.
Residual Skill Optimization for Text-to-SQL Ensembles
cs.CL 2026-05 unverdicted novelty 7.0

Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
Finite-Time Regret Analysis of Retry-Aware Bandits
cs.LG 2026-05 unverdicted novelty 7.0

ReMax achieves the first sublinear finite-time regret bound for Gaussian bandits with M=2 by deriving an expected-improvement balance condition for its optimal sampling distribution and separating saturation from unde...
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
cs.LG 2026-05 unverdicted novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
cs.AI 2026-05 conditional novelty 7.0

The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
cs.LG 2026-05 unverdicted novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
cs.AI 2026-04 unverdicted novelty 7.0

GFT uses group advantage learning and dynamic coefficient rectification to fix reward sparsity and optimization instability in SFT for LLMs, yielding better policies than standard SFT.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
cs.LG 2026-01 unverdicted novelty 7.0

Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
cs.CV 2025-09 unverdicted novelty 7.0

HiDe is a training-free hierarchical decoupling method that separates key visual tokens from background interference in high-resolution MLLMs to achieve new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
cs.LG 2025-07 unverdicted novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
GRIT: Teaching MLLMs to Think with Images
cs.CV 2025-05 unverdicted novelty 7.0

GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
cs.LG 2025-04 accept novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
cs.LG 2026-05 conditional novelty 6.0

VPO modifies the GRPO advantage estimator to train LLMs for diversity across vector reward trade-offs, matching or exceeding scalar RL baselines on test-time search with larger gains at higher search budgets.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math ...
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on...
SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
cs.LG 2026-05 conditional novelty 6.0

Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
cs.AI 2026-05 unverdicted novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
cs.CR 2026-05 unverdicted novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.