RLCR augments standard RL rewards for LM reasoning with Brier scores on verbalized confidence, producing models that are both more accurate and better calibrated on in-domain and out-of-domain tasks.
We evaluate usingmath-verify, a mathematical expression evaluation system released by huggingface
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
RLCR augments standard RL rewards for LM reasoning with Brier scores on verbalized confidence, producing models that are both more accurate and better calibrated on in-domain and out-of-domain tasks.