pith. sign in

arxiv: 2505.19590 · v5 · pith:Y7KS2YN2new · submitted 2025-05-26 · 💻 cs.LG · cs.CL

Learning to Reason without External Rewards

Pith reviewed 2026-05-22 02:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reinforcement learning from internal feedbackself-certaintyunsupervised reasoninglarge language modelsGRPOgeneralizationautonomous learning
0
0 comments X

The pith

Language models can improve reasoning by using their own confidence scores as the only reward signal, without external feedback or labeled answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reinforcement Learning from Internal Feedback as a way for large language models to train on complex reasoning tasks using signals generated inside the model itself. Intuitor implements this by substituting self-certainty scores for the external rewards normally required in Group Relative Policy Optimization. On mathematical benchmarks the method reaches performance levels comparable to supervised approaches, yet it shows stronger results when transferred to out-of-domain problems such as code generation. A reader would care because the approach removes the need for costly domain-specific supervision, opening a route to autonomous improvement in settings where correct answers cannot be supplied or verified.

Core claim

Intuitor replaces external rewards in Group Relative Policy Optimization with self-certainty scores drawn directly from the model's token-level probabilities, allowing fully unsupervised policy optimization. This internal signal produces reasoning performance on mathematical tasks that matches methods relying on gold solutions while delivering better generalization when the trained model is evaluated on unrelated tasks such as code generation.

What carries the argument

self-certainty scores, the model's intrinsic measure of in its own generated reasoning steps, substituted directly for external rewards inside a modified GRPO update rule

If this is right

  • Reasoning training becomes feasible in domains that lack gold solutions or automated test cases.
  • Models avoid overfitting to narrow external reward structures and therefore transfer more readily to new tasks.
  • Autonomous AI systems can continue improving when human-provided verification is unavailable or prohibitively expensive.
  • A single internal signal can support learning across mathematically and programmatically distinct domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-certainty could function as a proxy for reasoning quality in more open-ended or subjective tasks where external correctness is difficult to define.
  • Combining self-certainty with other internal metrics such as cross-sample consistency might further stabilize learning.
  • The method invites tests on whether repeated self-certainty optimization gradually reduces systematic biases in the base model.

Load-bearing premise

Self-certainty scores from the model reliably indicate higher-quality reasoning steps even when the model has no external verification of correctness.

What would settle it

A controlled experiment in which Intuitor is run on a task where the model routinely assigns high self-certainty to demonstrably incorrect reasoning chains, resulting in performance below that of a no-reward baseline.

read the original abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Reinforcement Learning from Internal Feedback (RLIF) and an instantiation called Intuitor that substitutes a model's self-certainty scores for external verifiable rewards inside Group Relative Policy Optimization (GRPO). The central claim is that this fully unsupervised procedure matches the performance of standard GRPO on mathematical reasoning benchmarks and yields better generalization to out-of-domain tasks such as code generation, without any gold solutions or test cases.

Significance. If the central claim is substantiated, the work would demonstrate that intrinsic model signals can serve as effective training objectives for complex reasoning, offering a scalable route to unsupervised improvement in domains where external verification is unavailable. The public release of code is a clear strength for reproducibility.

major comments (3)
  1. [Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.
  2. [Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.
  3. [Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.
minor comments (1)
  1. [Method] Notation for self-certainty is introduced without an explicit equation; adding a numbered definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate where revisions will be made to strengthen the empirical presentation, methodological justification, and evaluation rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that Intuitor 'matches GRPO's performance on mathematical benchmarks while achieving better generalization' is presented without any quantitative numbers, baseline tables, error bars, or statistical tests. This absence is load-bearing for the central empirical claim.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the central claim. The full manuscript includes Tables 1-3 reporting exact accuracies (e.g., Intuitor reaches 82.4% on GSM8K and 45.1% on MATH, within 1-3 points of GRPO), comparisons against multiple baselines, and results averaged over 3-5 random seeds with standard deviations. We will revise the abstract to include these key numbers and a brief reference to the statistical comparisons shown in the main text. revision: yes

  2. Referee: [Method] Method section: Self-certainty is defined as an internal model signal and used directly as the reward in the GRPO replacement. No derivation or ablation demonstrates that this signal is independent of the policy parameters being updated, leaving open the possibility that the optimization simply amplifies existing fluency biases rather than improving logical validity.

    Authors: This is a substantive concern about potential circularity. In the GRPO formulation, self-certainty is computed on the current policy but is normalized relative to other responses within the same prompt group; this relative ranking encourages the model to increase certainty on better responses rather than globally amplifying fluency. We will add a dedicated ablation subsection comparing self-certainty rewards against a pure fluency baseline (e.g., average log-probability) and will report how self-certainty correlates with downstream correctness metrics during training to demonstrate that the signal drives logical improvements beyond static biases. revision: yes

  3. Referee: [Results] Results on out-of-domain generalization: The reported gains on code generation are presented as evidence of transferable reasoning, yet the evaluation lacks an independent correctness oracle. Without such a control it is impossible to distinguish genuine reasoning improvement from domain-specific increases in confident-sounding but incorrect outputs.

    Authors: We acknowledge the limitation in fully unsupervised code evaluation. Where test cases were available in the evaluation suite, we report execution-based pass rates as an independent oracle; for the remainder we rely on self-consistency and model-generated tests. To address the referee's point directly, we will expand the results section with a controlled subset of problems that include human-verified correctness labels and will report the correlation between self-certainty gains and verified correctness to help distinguish genuine reasoning transfer from overconfident but incorrect outputs. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in Intuitor's RLIF derivation

full rationale

The paper defines Intuitor explicitly as a substitution of GRPO's external verifiable rewards with the model's own self-certainty scores, then reports empirical results on math benchmarks and out-of-domain code generation. This substitution is a stated design choice rather than a self-referential loop in which the claimed outcome (improved reasoning) is mathematically forced by the input definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or method description. Performance claims rest on external benchmark comparisons, which function as independent evaluation rather than internal tautologies. The framework is therefore self-contained against external measures.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that internal confidence correlates with reasoning quality without external grounding; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Model-generated self-certainty scores provide a usable proxy for reasoning correctness.
    Invoked when replacing external rewards with internal signals in the RLIF framework.

pith-pipeline@v0.9.0 · 5705 in / 1102 out tokens · 32186 ms · 2026-05-22T02:49:15.908257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  2. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  3. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  4. Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

    cs.LG 2026-05 unverdicted novelty 6.0

    TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...

  5. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  6. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  7. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  8. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.

  9. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  10. Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.

  11. TEMPO: Scaling Test-time Training for Large Reasoning Models

    cs.LG 2026-04 unverdicted novelty 6.0

    TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

  12. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  13. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  14. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  15. SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.

  16. Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

    cs.CL 2026-04 unverdicted novelty 6.0

    QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.

  17. Sparse Reward Subsystem in Large Language Models

    cs.CL 2026-02 unverdicted novelty 6.0

    LLM hidden states contain a sparse reward subsystem consisting of value neurons that predict state value and dopamine neurons that encode step-level temporal difference errors.

  18. LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

    cs.CL 2025-10 unverdicted novelty 6.0

    LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.

  19. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    cs.LG 2025-07 unverdicted novelty 6.0

    RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

  20. Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while i...

  21. Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

    cs.LG 2026-05 unverdicted novelty 5.0

    ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.

  22. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  23. Your Model Diversity, Not Method, Determines Reasoning Strategy

    cs.AI 2026-04 unverdicted novelty 5.0

    The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.

  24. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  25. Self-Aligned Reward: Towards Effective and Efficient Reasoners

    cs.LG 2025-09 unverdicted novelty 5.0

    Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

  26. Self-Rewarding Vision-Language Model via Reasoning Decomposition

    cs.CV 2025-08 unverdicted novelty 5.0

    Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.