Learning to Reason without External Rewards

Aosong Feng; Dawn Song; Sergey Levine; Xuandong Zhao; Zhewei Kang

arxiv: 2505.19590 · v5 · pith:Y7KS2YN2new · submitted 2025-05-26 · 💻 cs.LG · cs.CL

Learning to Reason without External Rewards

Xuandong Zhao , Zhewei Kang , Aosong Feng , Sergey Levine , Dawn Song This is my paper

Pith reviewed 2026-05-22 02:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learning from internal feedbackself-certaintyunsupervised reasoninglarge language modelsGRPOgeneralizationautonomous learning

0 comments

The pith

Language models can improve reasoning by using their own confidence scores as the only reward signal, without external feedback or labeled answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reinforcement Learning from Internal Feedback as a way for large language models to train on complex reasoning tasks using signals generated inside the model itself. Intuitor implements this by substituting self-certainty scores for the external rewards normally required in Group Relative Policy Optimization. On mathematical benchmarks the method reaches performance levels comparable to supervised approaches, yet it shows stronger results when transferred to out-of-domain problems such as code generation. A reader would care because the approach removes the need for costly domain-specific supervision, opening a route to autonomous improvement in settings where correct answers cannot be supplied or verified.

Core claim

Intuitor replaces external rewards in Group Relative Policy Optimization with self-certainty scores drawn directly from the model's token-level probabilities, allowing fully unsupervised policy optimization. This internal signal produces reasoning performance on mathematical tasks that matches methods relying on gold solutions while delivering better generalization when the trained model is evaluated on unrelated tasks such as code generation.

What carries the argument

self-certainty scores, the model's intrinsic measure of in its own generated reasoning steps, substituted directly for external rewards inside a modified GRPO update rule

If this is right

Reasoning training becomes feasible in domains that lack gold solutions or automated test cases.
Models avoid overfitting to narrow external reward structures and therefore transfer more readily to new tasks.
Autonomous AI systems can continue improving when human-provided verification is unavailable or prohibitively expensive.
A single internal signal can support learning across mathematically and programmatically distinct domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-certainty could function as a proxy for reasoning quality in more open-ended or subjective tasks where external correctness is difficult to define.
Combining self-certainty with other internal metrics such as cross-sample consistency might further stabilize learning.
The method invites tests on whether repeated self-certainty optimization gradually reduces systematic biases in the base model.

Load-bearing premise

Self-certainty scores from the model reliably indicate higher-quality reasoning steps even when the model has no external verification of correctness.

What would settle it

A controlled experiment in which Intuitor is run on a task where the model routinely assigns high self-certainty to demonstrably incorrect reasoning chains, resulting in performance below that of a no-reward baseline.

read the original abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Intuitor swaps GRPO's external rewards for the model's self-certainty scores and reports matching math performance plus better code generalization, but the evidence that certainty tracks actual correctness rather than fluency is still thin.

read the letter

The core result is that you can run GRPO-style updates on reasoning tasks using nothing but the model's own token-level certainty as the reward signal and still match the performance of reward-verified baselines on math while picking up some out-of-domain gains on code. That removes the need for gold solutions or test cases, which is the practical hook. The RLIF framing is a straightforward extension of existing policy optimization, and the implementation detail of replacing the reward term directly in GRPO keeps the change minimal and reproducible. Code release helps here too. The experiments appear to show the method holds its own on in-domain math and transfers better than the supervised version to code generation, which is the part worth checking in detail. The main soft spot is the assumption that higher self-certainty reliably marks better reasoning steps. If certainty mostly tracks surface features like length, common phrasing, or high-probability tokens, the policy could learn to sound more confident without becoming more correct. That risk is especially relevant for the claimed generalization advantage, since code outputs can look fluent while still being wrong. The abstract does not spell out controls that separate logical validity from these other correlates, so the causal story remains open. Circularity is a secondary worry but not fatal if the certainty signal is computed from a frozen copy or held-out logits. Overall this is the kind of paper that belongs in a reading group focused on unsupervised RL for LLMs. Readers working on reward-free scaling will find the setup and the reported numbers useful even if they end up running their own ablations on the certainty metric. It deserves a serious referee because the idea is clean enough to test and the empirical claims are specific enough to falsify.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Reinforcement Learning from Internal Feedback (RLIF) and an instantiation called Intuitor that substitutes a model's self-certainty scores for external verifiable rewards inside Group Relative Policy Optimization (GRPO). The central claim is that this fully unsupervised procedure matches the performance of standard GRPO on mathematical reasoning benchmarks and yields better generalization to out-of-domain tasks such as code generation, without any gold solutions or test cases.

Significance. If the central claim is substantiated, the work would demonstrate that intrinsic model signals can serve as effective training objectives for complex reasoning, offering a scalable route to unsupervised improvement in domains where external verification is unavailable. The public release of code is a clear strength for reproducibility.

major comments (3)

[Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.
[Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.
[Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.

minor comments (1)

[Method] Notation for self-certainty is introduced without an explicit equation; adding a numbered definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate where revisions will be made to strengthen the empirical presentation, methodological justification, and evaluation rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.

Authors: We agree that the abstract would benefit from explicit quantitative support for the central claim. The full manuscript includes Tables 1-3 reporting exact accuracies (e.g., Intuitor reaches 82.4% on GSM8K and 45.1% on MATH, within 1-3 points of GRPO), comparisons against multiple baselines, and results averaged over 3-5 random seeds with standard deviations. We will revise the abstract to include these key numbers and a brief reference to the statistical comparisons shown in the main text. revision: yes
Referee: [Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.

Authors: This is a substantive concern about potential circularity. In the GRPO formulation, self-certainty is computed on the current policy but is normalized relative to other responses within the same prompt group; this relative ranking encourages the model to increase certainty on better responses rather than globally amplifying fluency. We will add a dedicated ablation subsection comparing self-certainty rewards against a pure fluency baseline (e.g., average log-probability) and will report how self-certainty correlates with downstream correctness metrics during training to demonstrate that the signal drives logical improvements beyond static biases. revision: yes
Referee: [Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.

Authors: We acknowledge the limitation in fully unsupervised code evaluation. Where test cases were available in the evaluation suite, we report execution-based pass rates as an independent oracle; for the remainder we rely on self-consistency and model-generated tests. To address the referee's point directly, we will expand the results section with a controlled subset of problems that include human-verified correctness labels and will report the correlation between self-certainty gains and verified correctness to help distinguish genuine reasoning transfer from overconfident but incorrect outputs. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in Intuitor's RLIF derivation

full rationale

The paper defines Intuitor explicitly as a substitution of GRPO's external verifiable rewards with the model's own self-certainty scores, then reports empirical results on math benchmarks and out-of-domain code generation. This substitution is a stated design choice rather than a self-referential loop in which the claimed outcome (improved reasoning) is mathematically forced by the input definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or method description. Performance claims rest on external benchmark comparisons, which function as independent evaluation rather than internal tautologies. The framework is therefore self-contained against external measures.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that internal confidence correlates with reasoning quality without external grounding; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Model-generated self-certainty scores provide a usable proxy for reasoning correctness.
Invoked when replacing external rewards with internal signals in the RLIF framework.

pith-pipeline@v0.9.0 · 5705 in / 1102 out tokens · 32186 ms · 2026-05-22T02:49:15.908257+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose INTUITOR, an RLIF method that uses a model’s own confidence—termed self-certainty—as its sole reward signal... Self-certainty(o|q) := 1/|o| ∑ KL(U ∥ p_πθ (·|q, o<i))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments demonstrate that INTUITOR matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
cs.LG 2026-05 unverdicted novelty 6.0

TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
cs.CL 2026-04 unverdicted novelty 6.0

QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.
Sparse Reward Subsystem in Large Language Models
cs.CL 2026-02 unverdicted novelty 6.0

LLM hidden states contain a sparse reward subsystem consisting of value neurons that predict state value and dopamine neurons that encode step-level temporal difference errors.
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
cs.CL 2025-10 unverdicted novelty 6.0

LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
cs.LG 2025-07 unverdicted novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
cs.LG 2026-05 unverdicted novelty 5.0

The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while i...
Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
cs.LG 2026-05 unverdicted novelty 5.0

ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Your Model Diversity, Not Method, Determines Reasoning Strategy
cs.AI 2026-04 unverdicted novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Self-Aligned Reward: Towards Effective and Efficient Reasoners
cs.LG 2025-09 unverdicted novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Self-Rewarding Vision-Language Model via Reasoning Decomposition
cs.CV 2025-08 unverdicted novelty 5.0

Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.