Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.
A self-supervised reinforcement learning approach for fine-tuning large language models using cross-attention signals
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study
Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.