RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
GRPO - LEAD : A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
ACOER applies adaptive correct-only efficiency rewards in GRPO to avoid reward collapse, yielding higher accuracy and over 60% fewer tokens on math reasoning benchmarks.
citing papers explorer
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards
ACOER applies adaptive correct-only efficiency rewards in GRPO to avoid reward collapse, yielding higher accuracy and over 60% fewer tokens on math reasoning benchmarks.