pith. machine review for the scientific record. sign in

arxiv: 2603.09117 · v2 · submitted 2026-03-10 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learning from verifiable rewardsLLM calibrationgradient conflictover-confidencedecoupling objectivesRLVRDCPOpolicy optimization
0
0 comments X

The pith

A gradient conflict between accuracy and calibration in RLVR is resolved by decoupling the objectives in DCPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning from verifiable rewards boosts LLM reasoning performance yet triggers severe over-confidence because the gradient signals for improving answer accuracy directly oppose those for improving calibration. Previous attempts to add calibration terms to the existing objective fail due to this inherent tension. By introducing DCPO, the method separates the reasoning optimization from the calibration optimization, allowing each to proceed without interference. This separation keeps reasoning accuracy on par with standard GRPO while delivering the strongest calibration results and sharply reducing over-confident errors on wrong answers.

Core claim

The central claim is that a fundamental gradient conflict exists between maximizing policy accuracy and minimizing calibration error under RLVR, and that systematically decoupling the reasoning and calibration objectives in the DCPO framework preserves accuracy comparable to GRPO while achieving the best calibration performance and substantially mitigating over-confidence.

What carries the argument

DCPO, the framework that decouples reasoning optimization from calibration optimization to eliminate gradient conflicts.

If this is right

  • Models trained under DCPO maintain reasoning accuracy while producing confidence scores that more closely match actual correctness.
  • The over-confidence problem on incorrect answers is substantially reduced compared with standard RLVR methods.
  • Calibration performance reaches the best reported levels without requiring direct addition of calibration terms to the accuracy objective.
  • The separation provides a practical route to more reliable LLM outputs on tasks with verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar objective decoupling could address other gradient conflicts that arise when RLVR is combined with safety or efficiency constraints.
  • The technique may generalize to non-LLM settings where reward maximization and uncertainty estimation compete.
  • Testing DCPO on larger models and varied verifiable reward distributions would clarify the scope of the decoupling benefit.

Load-bearing premise

The decoupling step can be implemented without introducing new optimization instabilities or unintended effects on other model behaviors.

What would settle it

A controlled experiment in which DCPO either drops reasoning accuracy below GRPO levels or shows no improvement in calibration error metrics on a standard verifiable-reward benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.09117 by Boxi Cao, Hongyu Lin, Jinglin Yang, Le Sun, Min He, Xianpei Han, Xueru Wen, Yaojie Lu, Zhengzhao Ma.

Figure 1
Figure 1. Figure 1: Illustration of gradient conflict between policy accuracy maximization and calibration error minimization. advances in mathematical reasoning (Guo et al., 2025; Hu et al., 2025), code generation (Luo et al., 2025a), and ques￾tion answering (Jaech et al., 2024; Hu et al., 2025) tasks. Despite the success, RLVR often leads to severe calibration degeneration, emerging as a critical bottleneck that limits the … view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of DCPO, which leverages block-wise verbalized confidence rollout and decoupled advantage estimation to decouple the optimization objectives of accuracy and calibration, and further integrates instance-level and group-level signals for more stable calibration optimization. the key factors that give rise to the “accuracy–calibration tradeoff”. Specifically, we reveal a critical gradien… view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams for different LLMs. The dashed line denotes perfect calibration; bar height indicates empirical accuracy per confidence bin, and color intensity reflects sample frequency. The Expected Calibration Error (ECE) is reported above each subplot, revealing prevalent over-confidence across models [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The accuracy and calibration performance of QWEN3- 8B trained with different RL methods.The figures illustrate that while existing calibration optimization methods can improve model calibration, their accuracy decreases. 3.3. Accuracy-Calibration Tradeoff of Coupled Optimization Recent calibration-aware reinforcement learning methods aim to jointly optimize reasoning accuracy and confidence calibration by … view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy and PCE on AIME25 dataset at different training steps for GRPO and DCPO. The figures illustrate that during the training process, our method can significantly reduce over-confidence while preserving accuracy. top of GRPO yields only marginal ECE improvements (e.g., from 0.370 to 0.363 on AIME24), while achieving a low AUROC of 0.642, substantially below DCPO’s 0.914, which indicates that RLVR sign… view at source ↗
Figure 7
Figure 7. Figure 7: The gradient-norm dynamics across different training methods, which demonstrates that DCPO achieves more stable optimization dynamics than other methods. reasoning behaviors. In contrast, DCPO preserves reasoning performance while achieving better calibration. Overall, these results demonstrate that decoupled optimiza￾tion, hybrid group-instance supervision, and on-policy cal￾ibration are all critical comp… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of verbalized confidence predictions across 5 mathematical benchmarks. The y-axis is log-scaled to better visualize the highly concentrated confidence distributions. and continuous confidence distribution, which demonstrates that decoupled calibration with hybrid supervision is critical for learning expressive and reliable verbalized confidence. 7. Conclusion In this paper, we analyze in theor… view at source ↗
Figure 9
Figure 9. Figure 9: Generation length during training. C.3. Hyperparameter λ Sensitivity DCPO introduces a hybird coefficient λ between group-level and instance-level calibration objectives. In the main paper, we report comparisons among DCPO (λ = 0.5), DCPO-I (λ = 0), and DCPO-G (λ = 1.0), showing that λ = 0.5 achieves a favorable balance between accuracy and calibration [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that RLVR improves LLM reasoning performance but causes severe calibration degeneration, leading to over-confidence on incorrect answers. It identifies a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error through theoretical analysis, and proposes DCPO as a decoupling framework that separates reasoning and calibration objectives. Experiments show DCPO achieves accuracy on par with GRPO while delivering the best calibration metrics and substantially reducing over-confidence.

Significance. If the gradient conflict derivation and empirical results hold, this provides a practical and insightful solution for reliable LLM deployment in reasoning tasks. The decoupling strategy addresses a key tension in RLVR optimization and could inform future multi-objective RL methods for LLMs, with the preserved accuracy alongside improved calibration being a notable strength.

major comments (1)
  1. [Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.
minor comments (3)
  1. [Method] Clarify the exact implementation of the decoupling in DCPO, including the modified loss terms and any additional hyperparameters introduced.
  2. [Experiments] Include ablation studies on the impact of the decoupling on other model behaviors beyond accuracy and calibration, such as response length or diversity.
  3. [Experiments] Ensure the experimental setup details (e.g., datasets, model sizes, training steps) are fully specified to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The feedback on the theoretical analysis is constructive, and we address it directly below.

read point-by-point responses
  1. Referee: [Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.

    Authors: We appreciate this observation. Section 3.2 of the manuscript derives the gradient conflict by contrasting the policy gradient term for accuracy maximization (which increases probability mass on correct tokens) against the calibration penalty term (which reduces overconfidence on incorrect answers). The analysis shows that these gradients oppose each other under the verifiable-reward setting, leading to a fundamental tension rather than a simple weighting issue. To strengthen clarity, we will revise the section to include the explicit gradient expressions for both objectives and a short proof sketch demonstrating that no fixed reweighting can eliminate the directional conflict. This addition will make the argument self-contained while preserving the original conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with a theoretical demonstration of gradient conflict between accuracy maximization and calibration minimization, which is presented as an independent analysis rather than a redefinition of terms or a fit to its own outputs. DCPO is then introduced as a decoupling framework motivated by this conflict, with empirical validation against external baselines such as GRPO showing preserved accuracy and improved calibration metrics. No load-bearing step reduces by construction to self-citation chains, ansatz smuggling, or renaming of known results; the central claims remain self-contained against external benchmarks and falsifiable comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5452 in / 907 out tokens · 40392 ms · 2026-05-15T13:45:40.413602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Process Supervision of Confidence Margin for Calibrated LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    and Leskovec, J

    Bereket, M. and Leskovec, J. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800,

  2. [2]

    Mind the confidence gap: Overconfidence, cal- ibration, and distractor effects in large language models

    Chhikara, P. Mind the confidence gap: Overconfidence, cal- ibration, and distractor effects in large language models. arXiv preprint arXiv:2502.11028,

  3. [3]

    Beyond binary rewards: Training lms to reason about their uncertainty.arXiv preprint arXiv:2507.16806,

    Damani, M., Puri, I., Slocum, S., Shenfeld, I., Choshen, L., Kim, Y ., and Andreas, J. Beyond binary rewards: Training lms to reason about their uncertainty.arXiv preprint arXiv:2507.16806,

  4. [4]

    Re- trieve only when it needs: Adaptive retrieval augmenta- tion for hallucination mitigation in large language models

    Ding, H., Pang, L., Wei, Z., Shen, H., and Cheng, X. Re- trieve only when it needs: Adaptive retrieval augmenta- tion for hallucination mitigation in large language models. arXiv preprint arXiv:2402.10612,

  5. [5]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  6. [6]

    Do llms estimate uncertainty well in instruction-following? arXiv preprint arXiv:2410.14582,

    Heo, J., Xiong, M., Heinze-Deml, C., and Narain, J. Do llms estimate uncertainty well in instruction-following? arXiv preprint arXiv:2410.14582,

  7. [7]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  8. [8]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  9. [9]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    10 Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

  10. [10]

    Language Models (Mostly) Know What They Know

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

  11. [11]

    Kirichenko, P., Ibrahim, M., Chaudhuri, K., and Bell, S. J. Abstentionbench: Reasoning llms fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

  12. [12]

    Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  13. [13]

    Taming overcon- fidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724,

    Leng, J., Huang, C., Zhu, B., and Huang, J. Taming overcon- fidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724,

  14. [14]

    Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025a

    Li, P., Skripkin, M., Zubrey, A., Kuznetsov, A., and Oseledets, I. Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025a. Li, R., Long, J., Qi, M., Xia, H., Sha, L., Wang, P., and Sui, Z. Towards harmonized uncertainty estimation for large language models.arXiv preprint arXiv:2505.19073, 2025b. Lin, ...

  15. [15]

    Generating with confidence: Uncertainty quantification for black-box large language models.arXiv preprint arXiv:2305.19187,

    Lin, Z., Trivedi, S., and Sun, J. Generating with confidence: Uncertainty quantification for black-box large language models.arXiv preprint arXiv:2305.19187,

  16. [16]

    C 2gspg: Confidence- calibrated group sequence policy gradient towards self- aware reasoning.arXiv preprint arXiv:2509.23129,

    Liu, H., Wang, S., and Xu, H. C 2gspg: Confidence- calibrated group sequence policy gradient towards self- aware reasoning.arXiv preprint arXiv:2509.23129,

  17. [17]

    Towards fully exploiting llm internal states to enhance knowledge boundary perception.arXiv preprint arXiv:2502.11677,

    Ni, S., Bi, K., Guo, J., Yu, L., Bi, B., and Cheng, X. Towards fully exploiting llm internal states to enhance knowledge boundary perception.arXiv preprint arXiv:2502.11677,

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    Sconu: Selective conformal uncertainty in large language models.arXiv preprint arXiv:2504.14154,

    Wang, Z., Wang, Q., Zhang, Y ., Chen, T., Zhu, X., Shi, X., and Xu, K. Sconu: Selective conformal uncertainty in large language models.arXiv preprint arXiv:2504.14154,

  20. [20]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

  21. [21]

    Sayself: Teaching llms to express con- fidence with self-reflective rationales.arXiv preprint arXiv:2405.20974,

    Xu, T., Wu, S., Diao, S., Liu, X., Wang, X., Chen, Y ., and Gao, J. Sayself: Teaching llms to express con- fidence with self-reflective rationales.arXiv preprint arXiv:2405.20974,

  22. [22]

    Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122, 2024a. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv...

  23. [23]

    On Verbalized Confidence Scores for LLMs

    Yang, D., Tsai, Y .-H. H., and Yamada, M. On verbalized con- fidence scores for llms.arXiv preprint arXiv:2412.14737, 2024b. Yao, Z., Liu, Y ., Chen, Y ., Chen, J., Fang, J., Hou, L., Li, J., and Chua, T.-S. Are reasoning models more prone to hallucination?arXiv preprint arXiv:2505.23646,

  24. [24]

    Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

    11 Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards Yoon, D., Kim, S., Yang, S., Kim, S., Kim, S., Kim, Y ., Choi, E., Kim, Y ., and Seo, M. Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

  25. [25]

    and Math-AI, T

    Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2024,

  26. [26]

    and Math-AI, T

    Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2025,

  27. [27]

    Group Sequence Policy Optimization

    Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  28. [28]

    Codapo: Confidence and difficulty-adaptive policy optimization for post-training language models

    Zhou, Z., Lu, X., Cao, C., Miranda, B., Liu, T., Han, B., and Koyejo, S. Codapo: Confidence and difficulty-adaptive policy optimization for post-training language models. In 2nd AI for Math Workshop@ ICML 2025,

  29. [29]

    Each sample includes the top-20 token log-probabilities as input features, with the corresponding group-level average accuracy as supervision

    Training data are generated by sampling outputs from the GRPO-120 checkpoint on the DeepScaler dataset over 20 independent runs. Each sample includes the top-20 token log-probabilities as input features, with the corresponding group-level average accuracy as supervision. The model is trained using cross-entropy loss with a learning rate of1×10 −4. B.3. Ev...