Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Boxi Cao; Hongyu Lin; Jinglin Yang; Le Sun; Min He; Xianpei Han; Xueru Wen; Yaojie Lu; Zhengzhao Ma

arxiv: 2603.09117 · v3 · pith:B232T5UTnew · submitted 2026-03-10 · 💻 cs.LG · cs.AI· cs.CL

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma , Xueru Wen , Boxi Cao , Yaojie Lu , Hongyu Lin , Jinglin Yang , Min He , Xianpei Han

show 1 more author

Le Sun

This is my paper

Pith reviewed 2026-05-15 13:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learning from verifiable rewardsLLM calibrationgradient conflictover-confidencedecoupling objectivesRLVRDCPOpolicy optimization

0 comments

The pith

A gradient conflict between accuracy and calibration in RLVR is resolved by decoupling the objectives in DCPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning from verifiable rewards boosts LLM reasoning performance yet triggers severe over-confidence because the gradient signals for improving answer accuracy directly oppose those for improving calibration. Previous attempts to add calibration terms to the existing objective fail due to this inherent tension. By introducing DCPO, the method separates the reasoning optimization from the calibration optimization, allowing each to proceed without interference. This separation keeps reasoning accuracy on par with standard GRPO while delivering the strongest calibration results and sharply reducing over-confident errors on wrong answers.

Core claim

The central claim is that a fundamental gradient conflict exists between maximizing policy accuracy and minimizing calibration error under RLVR, and that systematically decoupling the reasoning and calibration objectives in the DCPO framework preserves accuracy comparable to GRPO while achieving the best calibration performance and substantially mitigating over-confidence.

What carries the argument

DCPO, the framework that decouples reasoning optimization from calibration optimization to eliminate gradient conflicts.

If this is right

Models trained under DCPO maintain reasoning accuracy while producing confidence scores that more closely match actual correctness.
The over-confidence problem on incorrect answers is substantially reduced compared with standard RLVR methods.
Calibration performance reaches the best reported levels without requiring direct addition of calibration terms to the accuracy objective.
The separation provides a practical route to more reliable LLM outputs on tasks with verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar objective decoupling could address other gradient conflicts that arise when RLVR is combined with safety or efficiency constraints.
The technique may generalize to non-LLM settings where reward maximization and uncertainty estimation compete.
Testing DCPO on larger models and varied verifiable reward distributions would clarify the scope of the decoupling benefit.

Load-bearing premise

The decoupling step can be implemented without introducing new optimization instabilities or unintended effects on other model behaviors.

What would settle it

A controlled experiment in which DCPO either drops reasoning accuracy below GRPO levels or shows no improvement in calibration error metrics on a standard verifiable-reward benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.09117 by Boxi Cao, Hongyu Lin, Jinglin Yang, Le Sun, Min He, Xianpei Han, Xueru Wen, Yaojie Lu, Zhengzhao Ma.

**Figure 1.** Figure 1: Illustration of gradient conflict between policy accuracy maximization and calibration error minimization. advances in mathematical reasoning (Guo et al., 2025; Hu et al., 2025), code generation (Luo et al., 2025a), and question answering (Jaech et al., 2024; Hu et al., 2025) tasks. Despite the success, RLVR often leads to severe calibration degeneration, emerging as a critical bottleneck that limits the … view at source ↗

**Figure 2.** Figure 2: The overall framework of DCPO, which leverages block-wise verbalized confidence rollout and decoupled advantage estimation to decouple the optimization objectives of accuracy and calibration, and further integrates instance-level and group-level signals for more stable calibration optimization. the key factors that give rise to the “accuracy–calibration tradeoff”. Specifically, we reveal a critical gradien… view at source ↗

**Figure 3.** Figure 3: Reliability diagrams for different LLMs. The dashed line denotes perfect calibration; bar height indicates empirical accuracy per confidence bin, and color intensity reflects sample frequency. The Expected Calibration Error (ECE) is reported above each subplot, revealing prevalent over-confidence across models [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: The accuracy and calibration performance of QWEN3- 8B trained with different RL methods.The figures illustrate that while existing calibration optimization methods can improve model calibration, their accuracy decreases. 3.3. Accuracy-Calibration Tradeoff of Coupled Optimization Recent calibration-aware reinforcement learning methods aim to jointly optimize reasoning accuracy and confidence calibration by … view at source ↗

**Figure 6.** Figure 6: Accuracy and PCE on AIME25 dataset at different training steps for GRPO and DCPO. The figures illustrate that during the training process, our method can significantly reduce over-confidence while preserving accuracy. top of GRPO yields only marginal ECE improvements (e.g., from 0.370 to 0.363 on AIME24), while achieving a low AUROC of 0.642, substantially below DCPO’s 0.914, which indicates that RLVR sign… view at source ↗

**Figure 7.** Figure 7: The gradient-norm dynamics across different training methods, which demonstrates that DCPO achieves more stable optimization dynamics than other methods. reasoning behaviors. In contrast, DCPO preserves reasoning performance while achieving better calibration. Overall, these results demonstrate that decoupled optimization, hybrid group-instance supervision, and on-policy calibration are all critical comp… view at source ↗

**Figure 8.** Figure 8: Distribution of verbalized confidence predictions across 5 mathematical benchmarks. The y-axis is log-scaled to better visualize the highly concentrated confidence distributions. and continuous confidence distribution, which demonstrates that decoupled calibration with hybrid supervision is critical for learning expressive and reliable verbalized confidence. 7. Conclusion In this paper, we analyze in theor… view at source ↗

**Figure 9.** Figure 9: Generation length during training. C.3. Hyperparameter λ Sensitivity DCPO introduces a hybird coefficient λ between group-level and instance-level calibration objectives. In the main paper, we report comparisons among DCPO (λ = 0.5), DCPO-I (λ = 0), and DCPO-G (λ = 1.0), showing that λ = 0.5 achieves a favorable balance between accuracy and calibration [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down a gradient conflict between accuracy and calibration in RLVR and shows DCPO can separate them to keep accuracy while fixing overconfidence.

read the letter

The main point is that RLVR boosts reasoning accuracy but creates overconfidence because the gradients for accuracy and calibration pull in opposite directions. DCPO decouples the objectives to avoid that clash, and the experiments indicate it matches GRPO on accuracy while delivering the strongest calibration results and cutting overconfidence noticeably. That gradient conflict argument is the clearest new piece here, and it explains why earlier attempts to just tack on a calibration term fell short. The paper positions this cleanly against existing RLVR work and backs the fix with direct comparisons. The results look consistent on the reported metrics, with no obvious instabilities or side effects showing up in the tests. One soft spot is that full reproducibility would need the exact training details and derivations, which the abstract only summarizes. The decoupling itself seems straightforward, but it would be useful to know how sensitive it is to hyperparameter choices or different verifiable reward setups. This is aimed at groups working on reliable LLM reasoning for deployment, especially where overconfidence is a practical blocker. Readers focused on calibration fixes in RL fine-tuning would get a usable method and a clear explanation of why it works. I would send it to peer review. The core claim is grounded enough and the evidence supports a full look.

Referee Report

1 major / 3 minor

Summary. The paper claims that RLVR improves LLM reasoning performance but causes severe calibration degeneration, leading to over-confidence on incorrect answers. It identifies a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error through theoretical analysis, and proposes DCPO as a decoupling framework that separates reasoning and calibration objectives. Experiments show DCPO achieves accuracy on par with GRPO while delivering the best calibration metrics and substantially reducing over-confidence.

Significance. If the gradient conflict derivation and empirical results hold, this provides a practical and insightful solution for reliable LLM deployment in reasoning tasks. The decoupling strategy addresses a key tension in RLVR optimization and could inform future multi-objective RL methods for LLMs, with the preserved accuracy alongside improved calibration being a notable strength.

major comments (1)

[Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.

minor comments (3)

[Method] Clarify the exact implementation of the decoupling in DCPO, including the modified loss terms and any additional hyperparameters introduced.
[Experiments] Include ablation studies on the impact of the decoupling on other model behaviors beyond accuracy and calibration, such as response length or diversity.
[Experiments] Ensure the experimental setup details (e.g., datasets, model sizes, training steps) are fully specified to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The feedback on the theoretical analysis is constructive, and we address it directly below.

read point-by-point responses

Referee: [Theoretical Analysis] The central claim rests on the theoretical demonstration of a gradient conflict; without explicit equations or proof sketches in the theoretical section showing why standard joint optimization fails, it is difficult to assess whether the conflict is fundamental or resolvable by reweighting.

Authors: We appreciate this observation. Section 3.2 of the manuscript derives the gradient conflict by contrasting the policy gradient term for accuracy maximization (which increases probability mass on correct tokens) against the calibration penalty term (which reduces overconfidence on incorrect answers). The analysis shows that these gradients oppose each other under the verifiable-reward setting, leading to a fundamental tension rather than a simple weighting issue. To strengthen clarity, we will revise the section to include the explicit gradient expressions for both objectives and a short proof sketch demonstrating that no fixed reweighting can eliminate the directional conflict. This addition will make the argument self-contained while preserving the original conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with a theoretical demonstration of gradient conflict between accuracy maximization and calibration minimization, which is presented as an independent analysis rather than a redefinition of terms or a fit to its own outputs. DCPO is then introduced as a decoupling framework motivated by this conflict, with empirical validation against external baselines such as GRPO showing preserved accuracy and improved calibration metrics. No load-bearing step reduces by construction to self-citation chains, ansatz smuggling, or renaming of known results; the central claims remain self-contained against external benchmarks and falsifiable comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5452 in / 907 out tokens · 40392 ms · 2026-05-15T13:45:40.413602+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.