Title resolution pending

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author= · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

citing papers explorer

Showing 3 of 3 citing papers.

Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 102
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought cs.LG · 2026-04-20 · unverdicted · none · ref 20
CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 37 · 2 links
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer