Recognition: no theorem link
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
Pith reviewed 2026-05-15 13:45 UTC · model grok-4.3
The pith
A gradient conflict between accuracy and calibration in RLVR is resolved by decoupling the objectives in DCPO.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a fundamental gradient conflict exists between maximizing policy accuracy and minimizing calibration error under RLVR, and that systematically decoupling the reasoning and calibration objectives in the DCPO framework preserves accuracy comparable to GRPO while achieving the best calibration performance and substantially mitigating over-confidence.
What carries the argument
DCPO, the framework that decouples reasoning optimization from calibration optimization to eliminate gradient conflicts.
If this is right
- Models trained under DCPO maintain reasoning accuracy while producing confidence scores that more closely match actual correctness.
- The over-confidence problem on incorrect answers is substantially reduced compared with standard RLVR methods.
- Calibration performance reaches the best reported levels without requiring direct addition of calibration terms to the accuracy objective.
- The separation provides a practical route to more reliable LLM outputs on tasks with verifiable rewards.
Where Pith is reading between the lines
- Similar objective decoupling could address other gradient conflicts that arise when RLVR is combined with safety or efficiency constraints.
- The technique may generalize to non-LLM settings where reward maximization and uncertainty estimation compete.
- Testing DCPO on larger models and varied verifiable reward distributions would clarify the scope of the decoupling benefit.
Load-bearing premise
The decoupling step can be implemented without introducing new optimization instabilities or unintended effects on other model behaviors.
What would settle it
A controlled experiment in which DCPO either drops reasoning accuracy below GRPO levels or shows no improvement in calibration error metrics on a standard verifiable-reward benchmark would falsify the central claim.
Figures
read the original abstract
Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLVR improves LLM reasoning performance but causes severe calibration degeneration, leading to over-confidence on incorrect answers. It identifies a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error through theoretical analysis, and proposes DCPO as a decoupling framework that separates reasoning and calibration objectives. Experiments show DCPO achieves accuracy on par with GRPO while delivering the best calibration metrics and substantially reducing over-confidence.
Significance. If the gradient conflict derivation and empirical results hold, this provides a practical and insightful solution for reliable LLM deployment in reasoning tasks. The decoupling strategy addresses a key tension in RLVR optimization and could inform future multi-objective RL methods for LLMs, with the preserved accuracy alongside improved calibration being a notable strength.
major comments (1)
- [Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.
minor comments (3)
- [Method] Clarify the exact implementation of the decoupling in DCPO, including the modified loss terms and any additional hyperparameters introduced.
- [Experiments] Include ablation studies on the impact of the decoupling on other model behaviors beyond accuracy and calibration, such as response length or diversity.
- [Experiments] Ensure the experimental setup details (e.g., datasets, model sizes, training steps) are fully specified to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. The feedback on the theoretical analysis is constructive, and we address it directly below.
read point-by-point responses
-
Referee: [Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.
Authors: We appreciate this observation. Section 3.2 of the manuscript derives the gradient conflict by contrasting the policy gradient term for accuracy maximization (which increases probability mass on correct tokens) against the calibration penalty term (which reduces overconfidence on incorrect answers). The analysis shows that these gradients oppose each other under the verifiable-reward setting, leading to a fundamental tension rather than a simple weighting issue. To strengthen clarity, we will revise the section to include the explicit gradient expressions for both objectives and a short proof sketch demonstrating that no fixed reweighting can eliminate the directional conflict. This addition will make the argument self-contained while preserving the original conclusions. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation begins with a theoretical demonstration of gradient conflict between accuracy maximization and calibration minimization, which is presented as an independent analysis rather than a redefinition of terms or a fit to its own outputs. DCPO is then introduced as a decoupling framework motivated by this conflict, with empirical validation against external baselines such as GRPO showing preserved accuracy and improved calibration metrics. No load-bearing step reduces by construction to self-citation chains, ansatz smuggling, or renaming of known results; the central claims remain self-contained against external benchmarks and falsifiable comparisons.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Reference graph
Works this paper leans on
-
[1]
Bereket, M. and Leskovec, J. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800,
-
[2]
Chhikara, P. Mind the confidence gap: Overconfidence, cal- ibration, and distractor effects in large language models. arXiv preprint arXiv:2502.11028,
-
[3]
Damani, M., Puri, I., Slocum, S., Shenfeld, I., Choshen, L., Kim, Y ., and Andreas, J. Beyond binary rewards: Training lms to reason about their uncertainty.arXiv preprint arXiv:2507.16806,
-
[4]
Ding, H., Pang, L., Wei, Z., Shen, H., and Cheng, X. Re- trieve only when it needs: Adaptive retrieval augmenta- tion for hallucination mitigation in large language models. arXiv preprint arXiv:2402.10612,
-
[5]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Do llms estimate uncertainty well in instruction-following? arXiv preprint arXiv:2410.14582,
Heo, J., Xiong, M., Heinze-Deml, C., and Narain, J. Do llms estimate uncertainty well in instruction-following? arXiv preprint arXiv:2410.14582,
-
[7]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
10 Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Taming overcon- fidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724,
Leng, J., Huang, C., Zhu, B., and Huang, J. Taming overcon- fidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724,
-
[14]
Li, P., Skripkin, M., Zubrey, A., Kuznetsov, A., and Oseledets, I. Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025a. Li, R., Long, J., Qi, M., Xia, H., Sha, L., Wang, P., and Sui, Z. Towards harmonized uncertainty estimation for large language models.arXiv preprint arXiv:2505.19073, 2025b. Lin, ...
-
[15]
Lin, Z., Trivedi, S., and Sun, J. Generating with confidence: Uncertainty quantification for black-box large language models.arXiv preprint arXiv:2305.19187,
-
[16]
Liu, H., Wang, S., and Xu, H. C 2gspg: Confidence- calibrated group sequence policy gradient towards self- aware reasoning.arXiv preprint arXiv:2509.23129,
-
[17]
Ni, S., Bi, K., Guo, J., Yu, L., Bi, B., and Cheng, X. Towards fully exploiting llm internal states to enhance knowledge boundary perception.arXiv preprint arXiv:2502.11677,
-
[18]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Sconu: Selective conformal uncertainty in large language models.arXiv preprint arXiv:2504.14154,
Wang, Z., Wang, Q., Zhang, Y ., Chen, T., Zhu, X., Shi, X., and Xu, K. Sconu: Selective conformal uncertainty in large language models.arXiv preprint arXiv:2504.14154,
-
[20]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Xu, T., Wu, S., Diao, S., Liu, X., Wang, X., Chen, Y ., and Gao, J. Sayself: Teaching llms to express con- fidence with self-reflective rationales.arXiv preprint arXiv:2405.20974,
-
[22]
Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122, 2024a. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
On Verbalized Confidence Scores for LLMs
Yang, D., Tsai, Y .-H. H., and Yamada, M. On verbalized con- fidence scores for llms.arXiv preprint arXiv:2412.14737, 2024b. Yao, Z., Liu, Y ., Chen, Y ., Chen, J., Fang, J., Hou, L., Li, J., and Chua, T.-S. Are reasoning models more prone to hallucination?arXiv preprint arXiv:2505.23646,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,
11 Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards Yoon, D., Kim, S., Yang, S., Kim, S., Kim, S., Kim, Y ., Choi, E., Kim, Y ., and Seo, M. Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,
-
[25]
Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2024,
work page 2024
-
[26]
Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2025,
work page 2025
-
[27]
Group Sequence Policy Optimization
Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Codapo: Confidence and difficulty-adaptive policy optimization for post-training language models
Zhou, Z., Lu, X., Cao, C., Miranda, B., Liu, T., Han, B., and Koyejo, S. Codapo: Confidence and difficulty-adaptive policy optimization for post-training language models. In 2nd AI for Math Workshop@ ICML 2025,
work page 2025
-
[29]
Training data are generated by sampling outputs from the GRPO-120 checkpoint on the DeepScaler dataset over 20 independent runs. Each sample includes the top-20 token log-probabilities as input features, with the corresponding group-level average accuracy as supervision. The model is trained using cross-entropy loss with a learning rate of1×10 −4. B.3. Ev...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.