arxiv: 2604.22074 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

Qinan Yu , Alexa Tartaglini , Peter Hase , Carlos Guestrin , Christopher Potts

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learningverifiable rewardschain of thoughtreasoning metricslanguage model post-trainingcausal importancesufficiency of reasoningQwen2.5

0 comments

The pith

Reinforcement learning from verifiable rewards improves model accuracy without necessarily making the reasoning chain causally important or sufficient for the answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common view that RLVR on chain-of-thought reasoning trains models to use their reasoning steps meaningfully. It introduces two metrics to check this: one for how much the reasoning affects the final answer and one for whether the reasoning alone lets a verifier get the right answer. Experiments show that while accuracy goes up with RLVR, these metrics often do not, raising doubts about what the models are actually learning. Simple additions like a bit of supervised training or extra rewards based on the new metrics can fix this without hurting accuracy.

Core claim

The central finding is that RLVR does not reliably improve the causal importance of reasoning or its sufficiency for verification, despite boosting task accuracy. This holds for the Qwen2.5 series on ReasoningGym tasks. Remedies include a small SFT prefix before RLVR or combining outcome rewards with auxiliary rewards that target the new metrics, achieving both accuracy and better reasoning properties.

What carries the argument

Causal Importance of Reasoning and Sufficiency of Reasoning metrics, which measure the effect of reasoning tokens on the answer and the verifiability of the answer from reasoning alone.

If this is right

RLVR training may produce high-accuracy models whose reasoning does not drive their outputs.
Adding a small supervised fine-tuning step before RLVR can increase both the causal importance and sufficiency of reasoning.
Joint rewards that include terms for causal importance and sufficiency can match pure RLVR accuracy while ensuring reasoning is important and sufficient.
Post-training procedures need to account for reasoning quality beyond just final answer correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these results generalize, training pipelines should incorporate reasoning-specific rewards to ensure transparent and verifiable chains.
Applications relying on model explanations for trust or safety may need to verify reasoning importance separately from accuracy.
These metrics might be adapted to other reasoning formats beyond chain-of-thought to check similar issues.

Load-bearing premise

The assumption that measuring the impact of reasoning steps on the final answer and checking if reasoning alone can lead to the answer captures whether reasoning truly matters, and that this holds across models and tasks.

What would settle it

An experiment where the model's answer changes substantially when the reasoning chain is removed or masked after RLVR training, or where an independent verifier cannot recover the correct answer from the reasoning text alone despite high task accuracy.

Figures

Figures reproduced from arXiv: 2604.22074 by Alexa Tartaglini, Carlos Guestrin, Christopher Potts, Peter Hase, Qinan Yu.

**Figure 1.** Figure 1: Overview of our two metrics. (a) Causal Importance of Reasoning (CIR). (b) Sufficiency of Reasoning (SR). uation and training loops where correctness is externally checkable rather than inferred from surface plausibility (Stojanovski et al., 2025). Our Sufficiency of Reasoning metric is inspired by work on reasoning legibility (Kirchner et al., 2024). While Kirchner et al. (2024) also train models to prod… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples illustrating task-dependent CIR/SR. On tasks like advanced geometry, RLVR tends to produce concrete intermediate computations that are both causally used (higher CIR) and externally checkable (higher SR). On algorithmic manipulation tasks like manipulate matrix, accuracy can improve while chains collapse into high-level or incomplete plans, reducing both CIR and SR. In Appendix E, we s… view at source ↗

**Figure 5.** Figure 5: Correlation analysis between improvements in accuracy, CIR, and SR. Improvements in CIR and accuracy are not significantly correlated (Spearman ρ = 0.17; p = 0.31), whereas improvements in SR and accuracy are correlated (ρ = 0.57; p = 0.0001). On the other hand, SR and CIR decrease when the accuracy improvement is low (∆Acc < 0.5). To further validate this hypothesis, we trained separate models to learn … view at source ↗

**Figure 6.** Figure 6: SFT-before-RL effect on CIR and SR, across different supervised data sizes: Here n denotes the number of expert trajectories used for SFT before RL training. The figure tracks Accuracy, CIR, and SR during the subsequent RL phase starting from the post-SFT checkpoint. With minimal SFT data, CIR and SR both increase substantially, suggesting that SFT helps produce more interpretable and verifiable reasoning … view at source ↗

**Figure 7.** Figure 7: Example of a reasoning chain after SFT + RLVR: on manipulate matrix from the same question in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: RLVR with auxiliary reward signals based on CIR and SR: The top row shows training with the standard rule-based reward on the output (Section 4) plus SR as an auxiliary reward (weighted with α), and the bottom row shows training with the rule-based reward plus CIR (weighted with β) as an auxiliary reward. We track different CIR, SR, and accuracy during training. The black lines correspond to standard RLVR … view at source ↗

**Figure 9.** Figure 9: CIR and SR for the 1.5B model. binary_alternation leg_counting shortest_path largest_island count_bits number_sorting letter_counting power_function word_sequence_reversal pool_matrix sentence_reordering futoshiki family_relationships number_format letter_jumble string_synthesis simple_equations fraction_simplification tsumego palindrome_generation decimal_arithmetic countdown mini_sudoku quantum_lock base… view at source ↗

**Figure 10.** Figure 10: CIR and SR for the 7B model. In the graphs below, we examine the relationship between our explanation metrics and downstream performance by computing Spearman rank correlations between (i) CIR and performance and (ii) SR and performance for two model scales. For the 7B model, both correlations were weak and not statistically significant (CIR vs. performance: ρ = 0.22, p = 0.26; SR vs. performance: ρ = 0.1… view at source ↗

**Figure 11.** Figure 11: CIR and SR for the Llama 3.2-3B model. 0.1 0.0 0.1 0.2 0.3 0.4 0.5 CIR 0.0 0.2 0.4 0.6 0.8 A c c 0.4 0.2 0.0 0.2 0.4 0.6 0.8 SR 0.0 0.2 0.4 0.6 0.8 A c c Correlation between CIR, SR and Acc Qwen2.5-1.5B [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Correlation between ∆ CIR, ∆ SR, and ∆ Acc for the 1.5B model. 0.2 0.0 0.2 0.4 0.6 CIR 0.0 0.2 0.4 0.6 0.8 1.0 A c c 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 SR 0.0 0.2 0.4 0.6 0.8 1.0 A c c Correlation between CIR, SR and Acc Qwen2.5-7B [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Correlation between ∆ CIR, ∆ SR, and ∆ Acc for the 7B model. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Lengths of reasoning traces for the Qwen 2.5-1.5B model. 0 20 40 60 80 100 120 140 160 RL Training Step 0.0 0.2 0.4 0.6 0.8 1.0 CIR CIR-reward CIR =0.0 =0.6 0 20 40 60 80 100 120 140 160 RL Training Step 0.0 0.2 0.4 0.6 0.8 1.0 SR CIR-reward SR =0.0 =0.6 0 20 40 60 80 100 120 140 160 RL Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy CIR-reward Accuracy =0.0 =0.6 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Lengths of reasoning traces for the Qwen 2.5-1.5B model. improves both CIR and SR compared to the N = 0 setting, and these gains are largely preserved or further improved over RLVR training steps. We also train with CIR as an auxiliary reward during RLVR ( [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Lengths of reasoning traces for the Qwen 2.5-1.5B model. The length of the reasoning chain is positively correlated with both and CIR and SR. CIR and length have a correlation of 0.7,and SR and length have a correlation of 0.59 with p ≤ 0.001. futoshiki manipulate_matrix string_manipulation string_synthesis pool_matrix binary_matrix mini_sudoku binary_alternation spiral_matrix tsumego sentence_reordering … view at source ↗

**Figure 17.** Figure 17: Lengths of reasoning traces for the Qwen 2.5-3B model. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Lengths of reasoning traces for the Qwen 2.5-7B model. letter_jumble futoshiki string_splitting mini_sudoku puzzle24 largest_island string_manipulation countdown shortest_path word_sorting modulo_grid pool_matrix family_relationships number_format number_sorting fraction_simplification power_function palindrome_generation count_bits letter_counting bitwise_arithmetic binary_alternation tsumego caesar_ciph… view at source ↗

**Figure 19.** Figure 19: Lengths of reasoning traces for the Llama 3.2-3B model. When the reasoning traces have length close to 0, reasoning traces have low CIR and SR. For example, the task futoshiki has the reasoning trace of “reasoning process here” at the end of training. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Comparison of reasoning trace quality for SR and CIR using gpt-4o-mini as an evaluator. We assess three properties of the traces: concrete intermediate steps, explicit calculations, and lexically rich reasoning language. In both SR and CIR, traces with higher scores are substantially more likely to exhibit all three qualities, especially concrete steps and explicit calculations. Beyond comparing high- and… view at source ↗

**Figure 21.** Figure 21: Quality of reasoning traces across training conditions, measured by the presence of concrete intermediate steps, explicit calculations, and lexically rich reasoning language. Relative to the initial model, SFT improves all three dimensions. Adding augmented rewards through CIR and SR further improves the first two properties—concrete steps and explicit calculations—but leads to a decline in lexical richne… view at source ↗

**Figure 22.** Figure 22: gpt-4o-mini and gpt-4.1-mini evaluation of SR on Qwen-2.5-3B’s reasoning chains 23 [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗

**Figure 23.** Figure 23: Additional seeds with CIR as augmented reward (β = 0.8) 2Due to the high API cost, we were not able to run multiple seeds for SR with augmented reward. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics to test whether RLVR on chain-of-thought actually causes models to rely on reasoning for answers. Experiments on Qwen2.5 models and ReasoningGym tasks show that RLVR improves accuracy but does not reliably raise CIR or SR; prepending a small amount of SFT or adding auxiliary CIR/SR rewards restores both metrics while preserving accuracy.

Significance. If the metrics are valid, the results challenge the common assumption that RLVR produces causally important reasoning and demonstrate simple, accuracy-preserving fixes. The work is strengthened by consistent patterns across model sizes and tasks plus explicit remedies that match baseline accuracy.

major comments (3)

[§4] §4 (Methods): The exact procedure for computing CIR (cumulative effect of reasoning tokens on the final answer) is not specified, including token masking strategy, baseline comparison, and whether gradients or counterfactuals are used. This is load-bearing because the central claim that RLVR fails to improve CIR rests on these measurements.
[§5.2, Table 2] §5.2 and Table 2: The statement that RLVR 'does not reliably improve' CIR/SR lacks reported statistical tests, confidence intervals, or effect sizes across the multiple runs and model sizes. Without these, it is unclear whether the observed flat or declining trends are distinguishable from noise.
[§3.1] §3.1: The SR metric relies on an external verifier reaching an 'unambiguous answer' from the reasoning alone; the verifier model, prompting, and decision threshold are not detailed, making it impossible to assess whether SR truly measures sufficiency independent of the original outcome reward.

minor comments (2)

[Figure 1] Figure 1 caption and §5.1: The y-axis scaling and error bars are not described, making visual comparison of CIR/SR deltas across conditions difficult.
[Related work] Related work: The discussion of prior work on reasoning faithfulness (e.g., citations to chain-of-thought faithfulness studies) could be expanded to clarify how CIR/SR differ from existing perturbation-based or attention-based probes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the recognition that our metrics and findings challenge assumptions about RLVR if valid. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§4] §4 (Methods): The exact procedure for computing CIR (cumulative effect of reasoning tokens on the final answer) is not specified, including token masking strategy, baseline comparison, and whether gradients or counterfactuals are used. This is load-bearing because the central claim that RLVR fails to improve CIR rests on these measurements.

Authors: We agree that the CIR procedure requires more explicit detail for reproducibility. In the revised manuscript, we will expand §4 to specify: (1) token masking removes all reasoning tokens while retaining the question prefix and answer suffix; (2) the baseline is the model's direct answer probability on the question alone (ablated reasoning); and (3) CIR is computed via counterfactual forward passes measuring the drop in correct-answer log-probability, without gradients. This ablation-based approach directly quantifies causal importance. revision: yes
Referee: [§5.2, Table 2] §5.2 and Table 2: The statement that RLVR 'does not reliably improve' CIR/SR lacks reported statistical tests, confidence intervals, or effect sizes across the multiple runs and model sizes. Without these, it is unclear whether the observed flat or declining trends are distinguishable from noise.

Authors: We acknowledge this point on statistical reporting. Experiments used 3–5 random seeds per condition, with consistent flat/declining trends across Qwen2.5 sizes (1.5B–7B) and ReasoningGym tasks. In revision, we will add per-cell standard deviations to Table 2 and §5.2, plus explicit effect-size notes comparing RLVR deltas to run variance. Formal hypothesis tests were not performed, but the patterns exceed observed noise; we can include p-values if required. revision: partial
Referee: [§3.1] §3.1: The SR metric relies on an external verifier reaching an 'unambiguous answer' from the reasoning alone; the verifier model, prompting, and decision threshold are not detailed, making it impossible to assess whether SR truly measures sufficiency independent of the original outcome reward.

Authors: We will clarify the SR details in the revised §3.1. The verifier is the base (pre-RLVR) Qwen2.5 model prompted with: 'Using only the following reasoning, output the final answer. Reasoning: [chain] Answer:'. An unambiguous answer is recorded when the verifier assigns >0.85 probability to the correct token. This uses the original model to ensure independence from the RLVR outcome reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CIR and SR as independent post-hoc metrics to evaluate reasoning chains from RLVR-trained models. These metrics are defined separately from the outcome-based RLVR objective and applied to model generations on ReasoningGym tasks. The central empirical claims (RLVR boosts accuracy without reliably lifting CIR/SR; SFT or auxiliary rewards can fix this) are supported by distinct measurements rather than any reduction to fitted parameters, self-referential definitions, or self-citation chains. No load-bearing step equates a prediction to its input by construction, and the analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the validity of the two new metrics as faithful measures of reasoning quality and on the representativeness of the Qwen2.5/ReasoningGym experiments. No explicit free parameters are stated in the abstract.

axioms (2)

domain assumption Reasoning tokens can be isolated and their cumulative causal effect on the final answer can be quantified via the CIR metric
This assumption is required for the definition and use of CIR to evaluate reasoning importance.
domain assumption A verifier can be used to test whether reasoning alone suffices for an unambiguous correct answer via the SR metric
This assumption underpins the SR metric and the claim that low SR indicates insufficient reasoning.

invented entities (2)

Causal Importance of Reasoning (CIR) metric no independent evidence
purpose: To quantify the cumulative causal effect of reasoning tokens on the final answer
Newly introduced in the paper to test the assumption about reasoning reliability.
Sufficiency of Reasoning (SR) metric no independent evidence
purpose: To measure whether reasoning alone allows a verifier to reach an unambiguous answer
Newly introduced in the paper to test the assumption about reasoning reliability.

pith-pipeline@v0.9.0 · 5551 in / 1714 out tokens · 44653 ms · 2026-05-09T21:04:42.690711+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Should Not Yet Be Credited with Decision Explanation
cs.AI 2026-05 unverdicted novelty 4.0

LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.

Reference graph

Works this paper leans on

14 extracted references · cited by 1 Pith paper

[1]

Forx≥0, this is the region above the liney=x
[2]

Next, consider the inequality y≤ −|x|+ 3

Forx <0, this is the region above the liney=−x. Next, consider the inequality y≤ −|x|+ 3 . Since |x| is always non-negative, −|x| is always non-positive, and thus −|x|+ 3 ranges from -3 to 3. This line touches at x= 0, y= 3 , goes down to x=±3 where y= 0 , and is perpendicular to the x-axis at these points. The boundaries of the region we are interested i...
[3]

Therefore, the intersection point is(1.5,1.5)

Intersection ofy=xandy=−x+ 3: x=−x+ 3 =⇒2x= 3 =⇒x= 1.5,soy= 1.5. Therefore, the intersection point is(1.5,1.5)
[4]

11111000000

Intersection ofy=−xandy=−x+ 3: −x=−x+ 3 =⇒0 = 3, this is a contradiction, indicating no intersection for this segment. So, the region of intersection is between (1.5,1.5) and the points where y=−x+ 3 intersects y=x at x= 3 and x=−3 . The vertices of the region are(−3,3),(3,3), and(1.5,1.5). The shape formed is a triangle with vertices at (−3,3) , (3,3) , ...
[5]

For the pattern starting with ‘1’: - Number of ‘1’s to be swapped to ‘0’s: 5 - Number of ‘0’s to be swapped to ‘1’s: 5 - Total swaps needed = 10
[6]

Therefore, the minimum number of swaps required is -1 because both scenarios require the same number of swaps

For the pattern starting with ‘0’: - Number of ‘0’s to be swapped to ‘1’s: 5 - Number of ‘1’s to be swapped to ‘0’s: 5 - Total swaps needed = 10 The minimum number of swaps needed is the minimum of the two scenarios calculated above. Therefore, the minimum number of swaps required is -1 because both scenarios require the same number of swaps. Thus, the an...
[7]

- For the cell (1, 2) containing 1, the nearest 0 is at (2, 2)

containing 0, the distance to the nearest 0 is 0 (since it is a 0). - For the cell (1, 2) containing 1, the nearest 0 is at (2, 2). The Manhattan distance is |1−2|+|2−2|= 1 . - For the cell (2, 1) containing 1, the nearest 0 is at (2, 2). The Manhattan distance is |2−2|+|1−2|= 1. The resulting matrix is: 2 1 0 1 For each cell in the given matrix, I need t...
[8]

After the identity transformation (no change), the matrix remains the same: 7 5 9 7 8 6
[9]

Vertically mirroring the matrix: 7 7 5 8 9 6
[10]

Mirroring the matrix along the counter- diagonal: 7 5 9 7 8 6
[11]

Mirroring the matrix along the counter- diagonal again: 9 7 7 6 8 5
[12]

Mirroring the matrix along the counter- diagonal again: 7 6 9 7 5 8
[13]

Setting all elements divisible by 7 to zero: 0 6 9 0 5 8 The final matrix is: 0 6 9 0 5 8 I will perform the operations on the given matrix in order: identity transformation, ver- tically mirror the matrix, mirror the matrix along the counterdiagonal three times, and then set all elements divisible by 7 to zero. The initial matrix is: 7 5 9 7 8 6 After id...
[14]

Transpose the matrix (swap rows with columns). 2. Reverse each row of the trans- posed matrix. The transposed matrix is: 6 0 6 6 Now reverse each row: 0 6 6 6 Tsumego Black should play a move that maximizes their capturing potential. Black can capture mor stones by play in row 20, col 1 Black should capture the stone at F5. 29 Outcome-Based Rewards Do Not...

2025