Learning to Reason without External Rewards
Pith reviewed 2026-05-22 02:49 UTC · model grok-4.3
The pith
Language models can improve reasoning by using their own confidence scores as the only reward signal, without external feedback or labeled answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Intuitor replaces external rewards in Group Relative Policy Optimization with self-certainty scores drawn directly from the model's token-level probabilities, allowing fully unsupervised policy optimization. This internal signal produces reasoning performance on mathematical tasks that matches methods relying on gold solutions while delivering better generalization when the trained model is evaluated on unrelated tasks such as code generation.
What carries the argument
self-certainty scores, the model's intrinsic measure of in its own generated reasoning steps, substituted directly for external rewards inside a modified GRPO update rule
If this is right
- Reasoning training becomes feasible in domains that lack gold solutions or automated test cases.
- Models avoid overfitting to narrow external reward structures and therefore transfer more readily to new tasks.
- Autonomous AI systems can continue improving when human-provided verification is unavailable or prohibitively expensive.
- A single internal signal can support learning across mathematically and programmatically distinct domains.
Where Pith is reading between the lines
- Self-certainty could function as a proxy for reasoning quality in more open-ended or subjective tasks where external correctness is difficult to define.
- Combining self-certainty with other internal metrics such as cross-sample consistency might further stabilize learning.
- The method invites tests on whether repeated self-certainty optimization gradually reduces systematic biases in the base model.
Load-bearing premise
Self-certainty scores from the model reliably indicate higher-quality reasoning steps even when the model has no external verification of correctness.
What would settle it
A controlled experiment in which Intuitor is run on a task where the model routinely assigns high self-certainty to demonstrably incorrect reasoning chains, resulting in performance below that of a no-reward baseline.
read the original abstract
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Reinforcement Learning from Internal Feedback (RLIF) and an instantiation called Intuitor that substitutes a model's self-certainty scores for external verifiable rewards inside Group Relative Policy Optimization (GRPO). The central claim is that this fully unsupervised procedure matches the performance of standard GRPO on mathematical reasoning benchmarks and yields better generalization to out-of-domain tasks such as code generation, without any gold solutions or test cases.
Significance. If the central claim is substantiated, the work would demonstrate that intrinsic model signals can serve as effective training objectives for complex reasoning, offering a scalable route to unsupervised improvement in domains where external verification is unavailable. The public release of code is a clear strength for reproducibility.
major comments (3)
- [Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.
- [Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.
- [Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.
minor comments (1)
- [Method] Notation for self-certainty is introduced without an explicit equation; adding a numbered definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate where revisions will be made to strengthen the empirical presentation, methodological justification, and evaluation rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.
Authors: We agree that the abstract would benefit from explicit quantitative support for the central claim. The full manuscript includes Tables 1-3 reporting exact accuracies (e.g., Intuitor reaches 82.4% on GSM8K and 45.1% on MATH, within 1-3 points of GRPO), comparisons against multiple baselines, and results averaged over 3-5 random seeds with standard deviations. We will revise the abstract to include these key numbers and a brief reference to the statistical comparisons shown in the main text. revision: yes
-
Referee: [Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.
Authors: This is a substantive concern about potential circularity. In the GRPO formulation, self-certainty is computed on the current policy but is normalized relative to other responses within the same prompt group; this relative ranking encourages the model to increase certainty on better responses rather than globally amplifying fluency. We will add a dedicated ablation subsection comparing self-certainty rewards against a pure fluency baseline (e.g., average log-probability) and will report how self-certainty correlates with downstream correctness metrics during training to demonstrate that the signal drives logical improvements beyond static biases. revision: yes
-
Referee: [Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.
Authors: We acknowledge the limitation in fully unsupervised code evaluation. Where test cases were available in the evaluation suite, we report execution-based pass rates as an independent oracle; for the remainder we rely on self-consistency and model-generated tests. To address the referee's point directly, we will expand the results section with a controlled subset of problems that include human-verified correctness labels and will report the correlation between self-certainty gains and verified correctness to help distinguish genuine reasoning transfer from overconfident but incorrect outputs. revision: partial
Circularity Check
No significant circularity detected in Intuitor's RLIF derivation
full rationale
The paper defines Intuitor explicitly as a substitution of GRPO's external verifiable rewards with the model's own self-certainty scores, then reports empirical results on math benchmarks and out-of-domain code generation. This substitution is a stated design choice rather than a self-referential loop in which the claimed outcome (improved reasoning) is mathematically forced by the input definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or method description. Performance claims rest on external benchmark comparisons, which function as independent evaluation rather than internal tautologies. The framework is therefore self-contained against external measures.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Model-generated self-certainty scores provide a usable proxy for reasoning correctness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose INTUITOR, an RLIF method that uses a model’s own confidence—termed self-certainty—as its sole reward signal... Self-certainty(o|q) := 1/|o| ∑ KL(U ∥ p_πθ (·|q, o<i))
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments demonstrate that INTUITOR matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
-
Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.
-
Sparse Reward Subsystem in Large Language Models
LLM hidden states contain a sparse reward subsystem consisting of value neurons that predict state value and dopamine neurons that encode step-level temporal difference errors.
-
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while i...
-
Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Self-Aligned Reward: Towards Effective and Efficient Reasoners
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
-
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.