Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

Chandan K Reddy; Khoa D Doan; Nan Zhang; Ngoc-Hieu Nguyen; Parshin Shojaee; Phuc Minh Nguyen; Rui Zhang

arxiv: 2605.17026 · v1 · pith:UDLFAG2Bnew · submitted 2026-05-16 · 💻 cs.LG

Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

Ngoc-Hieu Nguyen , Parshin Shojaee , Phuc Minh Nguyen , Nan Zhang , Chandan K Reddy , Khoa D Doan , Rui Zhang This is my paper

Pith reviewed 2026-05-19 20:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords reasoning modelscoverage shrinkagesupervised fine-tuningdecision pointsforks in the roadpass@kdata synthesisdiversity decoding

0 comments

The pith

Fine-tuning data with ambiguous decision points causes reasoning models to lose coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning models gain single-answer accuracy after supervised fine-tuning yet lose the ability to find correct solutions when multiple attempts are allowed. The paper traces this coverage shrinkage to the fine-tuning data itself, specifically the frequent appearance of decision points where several reasoning paths are valid but the patterns are hard to distinguish. Controlled experiments that insert such points into graph structures and reasoning tasks show the shrinkage scales directly with how often these forks occur. The authors also find that redesigning the data to handle decision points and adding decoding steps that favor variety can recover some of the lost coverage.

Core claim

The coverage shrinkage in reasoning models after SFT arises because the fine-tuning data contains many decision-point or forks-in-the-road scenarios—indecipherable patterns at which multiple valid reasoning paths exist—leading the model to narrow its exploration to preferred paths. This correlation is established through controlled case studies on graph branching and reasoning modes, and the effect can be partially reversed by targeted data synthesis that explicitly designs for decision points together with diversity-encouraging decoding.

What carries the argument

Decision-point or 'forks in the road' scenarios in the fine-tuning data, which create situations with multiple valid reasoning paths at ambiguous nodes and drive the model to overfit to specific paths.

If this is right

Higher prevalence of decision-point scenarios in the training data produces greater coverage shrinkage.
Targeted data synthesis that explicitly designs decision-points can partially mitigate shrinkage.
Diversity-encouraging decoding mechanisms provide an additional lever to limit loss of coverage.
Data-centric properties, rather than the fine-tuning algorithm alone, are a primary driver of shrinkage under SFT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Preserving broad coverage may require training data that deliberately retains exposure to ambiguous choice points rather than eliminating them.
The same narrowing effect could appear under other post-training regimes such as reinforcement learning if they also reward single-path consistency.
Quantifying the fraction of forks across standard reasoning datasets could allow prediction of shrinkage before full training runs.
Dataset construction rules that minimize indecipherable patterns at branch points might improve multi-solution generalization without sacrificing accuracy.

Load-bearing premise

The controlled case studies using graph branching and reasoning modes accurately capture the decision-point dynamics present in real fine-tuning datasets for reasoning models.

What would settle it

A dataset engineered with zero decision points that still produces measurable coverage shrinkage after fine-tuning, or a dataset rich in decision points that produces no shrinkage, would falsify the claimed driver.

Figures

Figures reproduced from arXiv: 2605.17026 by Chandan K Reddy, Khoa D Doan, Nan Zhang, Ngoc-Hieu Nguyen, Parshin Shojaee, Phuc Minh Nguyen, Rui Zhang.

**Figure 1.** Figure 1: Reasoning unfolds through forks in the road, choices made without knowing the path to truth. In this paper, we argue that data plays a central and underexplored role in driving the coverage shrinkage behavior. Instead of focusing on posttraining algorithms, we shift attention to the structure and properties of the post-training data, and ask a simple but critical question: what aspects of reasoning data e… view at source ↗

**Figure 2.** Figure 2: Illustrative examples of forks in the road case studies. (a) Graph navigation with indecipherable nodes; and (b) Mathematical reasoning with multiple valid solution modes. In both settings, decision points force commitment to a path without knowing which will succeed. effectively operate in both instruct (or non-thinking) and backtracking/thinking reasoning modes remains a challenging open problem (Wang et… view at source ↗

**Figure 3.** Figure 3: Effect of decision points on coverage in graph navigation task. Pass@k across SFT epochs for Forward vs. Reverse (w/o DP) problem solving settings. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Change in model confidence at decision points over the course of SFT; Right: At the last epoch, the model assigns high confidence to both correct and incorrect paths, indicating uncalibrated decisions that drive coverage collapse. We also looked deeper into Forward model behavior at decision points ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Model confidence at decision points across prompt variations with identical semantics per problem. Learning Spurious Cues for Branch Selection. Previous work shows that reasoning in language models is highly sensitive to minor input changes (Mirzadeh et al., 2025; Jiang et al., 2024b). We further examine this by probing the model’s branch selection: we perturb prompts by shuffling variable dependencies (e… view at source ↗

**Figure 6.** Figure 6: Diversity structure in SFT data. Data-level diversity distributes modes across problems, while problem-level diversity exposes multiple reasoning modes within each problem. 4.2 Reasoning Mode Selection The forks-in-the-road phenomenon is not limited to synthetic graph navigation settings; it also arises naturally in real-world reasoning tasks where multiple solution strategies coexist. During generation, t… view at source ↗

**Figure 7.** Figure 7: Effect of data diversity design on coverage in reasoning mode selection task. Pass@k across SFT epochs for data-level vs. problem-level data diversity. 0.0 0.2 0.4 0.6 0.8 1.0 P(Code Reasoning) 0 100 200 300 # of Questions Qwen2.5-0.5B Epoch 1 Epoch 2 Epoch 4 Epoch 8 0.0 0.2 0.4 0.6 0.8 1.0 P(Code Reasoning) 0 50 100 150 200 # of Questions EvoLM-1B 0.0 0.2 0.4 0.6 0.8 1.0 P(Code Reasoning) 0 100 200 300 40… view at source ↗

**Figure 8.** Figure 8: Reasoning mode preference under different diversity designs. Distribution of per problem code vs. NL reasoning across SFT epochs for data-level vs. problem-level diversity. is distributed. This suggests that coverage is not determined solely by how much diversity is present; it depends critically on how that diversity is also structured between problems. To better understand this difference, we also analyz… view at source ↗

**Figure 9.** Figure 9: Illustrative examples of linear vs. backtracking reasoning modes. 4.2.2 Linear vs. Backtracking Reasoning Another important mode of variation in reasoning is the structure of the reasoning process itself. We focus on two common modes: linear thinking and backtracking thinking. In linear thinking, the model follows a direct, forward chain of steps without revisiting earlier decisions (left). In contrast, ba… view at source ↗

**Figure 10.** Figure 10: Impact of prefix token manipulation on reasoning behavior. Small changes only in one of the initial tokens (e.g., “Okay,” “Let,” “To”) lead to considerably large variations in response length and accuracy across different benchmarks datasets (GSM8K, Math500, AIME 2024/2025). These results suggest that reasoning structure itself behaves like a fork in the road: the model must commit to either linear or bac… view at source ↗

**Figure 11.** Figure 11: Models’ default behavior shows over-thinking on factual QA (backtracking hurts) and under-thinking on counterfactual arithmetic (backtracking helps). Over-thinking and Under-thinking. We also perform a fine-grained analysis showing that strategy selection in distilled reasoning models correlates with irrelevant lexical features rather than the problem structure. We evaluate this on (1) simple knowledg… view at source ↗

**Figure 12.** Figure 12: Recovering coverage via prefix perturbation. Top-k prefix sampling (Top-8) mitigates coverage shrinkage and improves pass@k at larger k 11 [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Pass@k performance when running GRPO on models pretrained on forward and [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Performance and Response Length across different prompt prefixes. The accuracy of two models—DS-Qwen-1.5B, DS-Qwen-7B, and DS-LLaMA-8B—across mathematical benchmarks (GSM8K, MATH500, AIME24, and AIME25). The data demonstrates a high degree of performance variance (up to 20%) and response length variance (up to 5.8x) depending entirely on the starting word or phrase of the thinking prefix. Next, I'll add … view at source ↗

read the original abstract

Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or "forks in the road" scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties coverage shrinkage after SFT to decision-point density using synthetic graph and mode case studies, with partial mitigation via data synthesis and diversity decoding.

read the letter

The core observation is that reasoning models lose coverage under standard SFT because the fine-tuning data contains too many ambiguous decision points with multiple valid paths. They test this by building controlled proxies—graph branching with indecipherable nodes and separate reasoning-mode setups—and track how pass@k drops as the fraction of these points rises. The correlation comes through cleanly in their post-training dynamics, and they show that deliberately designing data to reduce or clarify those forks plus a diversity-aware decoder can recover some of the lost coverage.

Referee Report

2 major / 2 minor

Summary. The paper investigates coverage shrinkage (degradation in pass@k relative to the base model) in reasoning models after SFT-based post-training. It hypothesizes that this is driven by properties of the fine-tuning data, specifically 'forks in the road' or decision points where the model encounters indecipherable patterns with multiple valid reasoning paths. The authors test this via controlled case studies simulating such settings through indecipherable nodes in graph branching and reasoning modes, report a tight correlation between shrinkage and decision-point prevalence, and demonstrate partial mitigation through targeted data synthesis and diversity-encouraging decoding.

Significance. If the synthetic proxies accurately reflect real SFT data statistics, the work offers a data-centric explanation for a documented limitation in reasoning models and identifies actionable design levers (data synthesis and decoding) for mitigation. The controlled case-study approach is a methodological strength for isolating causal factors, though the overall significance hinges on generalization beyond the proxies.

major comments (2)

[§4 (Controlled Case Studies on Graph Branching and Reasoning Modes)] The central hypothesis requires that the synthetic graph-branching and reasoning-mode case studies reproduce the frequency, ambiguity structure, and path-validity distribution of decision points in actual fine-tuning corpora for reasoning models. The manuscript does not provide any matching statistics or validation that the controlled distributions align with real SFT datasets; without this, the observed correlation in the proxies does not establish the claimed driver for real-model shrinkage.
[Results and Experiments sections] The results on correlation and mitigation lack reported quantitative details such as dataset sizes, number of runs, error bars, or statistical significance tests. This undermines evaluation of whether the 'tight correlation' and 'partial mitigation' are robust enough to support the causal claim.

minor comments (2)

[Introduction] Define 'coverage shrinkage' and the precise metrics (pass@1 vs. pass@k) with equations or explicit formulas in the introduction or preliminaries to avoid ambiguity for readers unfamiliar with the prior literature.
[Abstract] The abstract would benefit from at least one concrete quantitative result (e.g., a reported correlation coefficient or mitigation improvement percentage) to allow immediate assessment of effect size.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the insightful comments, which help us clarify the scope and strengthen the empirical support of our work. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§4 (Controlled Case Studies on Graph Branching and Reasoning Modes)] The central hypothesis requires that the synthetic graph-branching and reasoning-mode case studies reproduce the frequency, ambiguity structure, and path-validity distribution of decision points in actual fine-tuning corpora for reasoning models. The manuscript does not provide any matching statistics or validation that the controlled distributions align with real SFT datasets; without this, the observed correlation in the proxies does not establish the claimed driver for real-model shrinkage.

Authors: We agree that validating the proxy against real data would strengthen the claims. However, since many SFT datasets for reasoning models are proprietary, direct matching is not always feasible. Our case studies are designed to capture the essential features of forks in the road—indecipherable patterns with multiple valid paths—by controlling the prevalence and structure in a way that mirrors the hypothesized mechanism. The tight correlation observed supports the data-centric explanation. In revision, we will add comparisons to statistics derivable from public datasets (e.g., number of solution paths in MATH problems) to better align the proxies. This addresses the concern without overclaiming generalization. revision: partial
Referee: [Results and Experiments sections] The results on correlation and mitigation lack reported quantitative details such as dataset sizes, number of runs, error bars, or statistical significance tests. This undermines evaluation of whether the 'tight correlation' and 'partial mitigation' are robust enough to support the causal claim.

Authors: This is a valid criticism. We will revise the Results and Experiments sections to include all requested details: specific dataset sizes for each case study, the number of runs (we used 5 independent runs per condition), error bars representing standard deviation, and statistical significance tests including p-values for the reported correlations and mitigation improvements. revision: yes

standing simulated objections not resolved

We cannot provide exact matching statistics from the proprietary fine-tuning data used in commercial reasoning models, as these are not publicly available.

Circularity Check

0 steps flagged

No significant circularity; hypothesis tested via independent controlled simulations

full rationale

The paper advances a hypothesis that coverage shrinkage arises from decision-point ('forks in the road') properties in fine-tuning data and tests it by constructing separate controlled case studies on graph branching and reasoning modes. These simulations track post-SFT dynamics and report observed correlations with decision-point prevalence; the case studies are not defined in terms of the target shrinkage metric, nor is any fitted parameter or self-citation chain used to derive the central claim. No equation or result reduces by construction to the inputs, and the experimental design remains externally falsifiable against real datasets. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or additional axioms are stated beyond the core hypothesis.

axioms (1)

domain assumption Shrinkage behavior arises under SFT-based post-training and is driven by data properties at decision points
This is the central hypothesis tested through controlled case studies.

pith-pipeline@v0.9.0 · 5767 in / 1208 out tokens · 55903 ms · 2026-05-19T20:28:40.256953+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or 'forks in the road' scenarios where model faces indecipherable patterns with multiple valid reasoning paths.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

coverage shrinkage behavior, where pass@k degrades relative to the base model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

URLhttps://openreview.net/forum?id=jvVQeSMeGM. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. Xingyu Dang, Christina Baek, Kaiyue...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2021
[2]

Substitutep=l+7 into the target expression, yieldings=l+18

work page
[3]

Substitutel=m+5 into the target expression, yieldings=m+23

work page
[4]

Substitutem=f+9 into the target expression, yieldings=f+32

work page
[5]

Substitutef=g+11 into the target expression, yieldings=g+43

work page
[6]

Substituteg=v+14 into the target expression, yieldings=v+57

work page
[7]

Substitutev=w+6 into the target expression, yieldings=w+63

work page
[8]

Substitutew=k+1 into the target expression, yieldings=k+64

work page
[9]

Substitutek=o+9 into the target expression, yieldings=o+73

work page
[10]

Substituteo=d+19 into the target expression, yieldings=d+92

work page
[11]

Thus,s=110

Substituted=18 into the target expression, yieldings=110. Thus,s=110 . A.2 Experimental settings Dataset.We generate a list of variables and equations from a star graph with 2 branches, and the lengths from the root to leaf nodes are 10. Next, we choose text templates and generate a pair of questions and ground truth solutions. This process results in a t...

work page 2024
[12]

**Write the numbers vertically, aligning the digits by place value:** ``` 6 6 + 5 0 -------- ```

work page
[13]

Under review

**Add the digits in each column, starting from the right (units place):** - **Units place:** 6 + 0 = 6 - **Sevens place:** 6 + 5 = 11 19 Preprint. Under review

work page
[14]

- Write down the remainder **4** and carry over the quotient **1** to the next higher place

**Since we're in base-7:** - 11 in base-10 is equal to **1** group of **7** (since \(11 \div 7 = 1\) with a remainder of 4). - Write down the remainder **4** and carry over the quotient **1** to the next higher place

work page
[15]

**Add the carried over value to the next higher place:** - The carried over **1** is added to the next (sevens) place: 1 (from the carry) + 1 (from the carry) = **2**

work page
[16]

**Write down the final result:** Combining all the results, we get **116** in base-7. **Final Answer:** \[ \boxed{116} \] Listing 2: Response from DS-LLaMA-8B with ”Okay” prefix <think> Okay, so I have this problem here: I need to add two numbers, 66 and 50, but both numbers are in base-7. Hmm, base-7, that's not our usual base-10 system we use every day....

work page
[17]

Add the rightmost digits: 6 + 0 = 6 (no carry-over)

work page
[18]

Since we're in base-7, 11 divided by 7 is 1 with a remainder of 4

Add the next digits: 6 + 5 = 11 in base-10. Since we're in base-7, 11 divided by 7 is 1 with a remainder of 4. Write down 4 and carry over 1

work page
[19]

Result: 1 4 6 in base-7

Add the carried-over 1 to the front. Result: 1 4 6 in base-7. So, the sum of 66 and 50 in base-7 is \boxed{146}. Example of CapitalQA Question:What is the capital of Greece? Correct answer:Athens Listing 3: Response from DS-Qwen-1.5B with ”The” prefix <think> The capital of Greece is Athens. It is the most populous city in the country and the capital of b...

work page

[1] [1]

URLhttps://openreview.net/forum?id=jvVQeSMeGM. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. Xingyu Dang, Christina Baek, Kaiyue...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2021

[2] [2]

Substitutep=l+7 into the target expression, yieldings=l+18

work page

[3] [3]

Substitutel=m+5 into the target expression, yieldings=m+23

work page

[4] [4]

Substitutem=f+9 into the target expression, yieldings=f+32

work page

[5] [5]

Substitutef=g+11 into the target expression, yieldings=g+43

work page

[6] [6]

Substituteg=v+14 into the target expression, yieldings=v+57

work page

[7] [7]

Substitutev=w+6 into the target expression, yieldings=w+63

work page

[8] [8]

Substitutew=k+1 into the target expression, yieldings=k+64

work page

[9] [9]

Substitutek=o+9 into the target expression, yieldings=o+73

work page

[10] [10]

Substituteo=d+19 into the target expression, yieldings=d+92

work page

[11] [11]

Thus,s=110

Substituted=18 into the target expression, yieldings=110. Thus,s=110 . A.2 Experimental settings Dataset.We generate a list of variables and equations from a star graph with 2 branches, and the lengths from the root to leaf nodes are 10. Next, we choose text templates and generate a pair of questions and ground truth solutions. This process results in a t...

work page 2024

[12] [12]

**Write the numbers vertically, aligning the digits by place value:** ``` 6 6 + 5 0 -------- ```

work page

[13] [13]

Under review

**Add the digits in each column, starting from the right (units place):** - **Units place:** 6 + 0 = 6 - **Sevens place:** 6 + 5 = 11 19 Preprint. Under review

work page

[14] [14]

- Write down the remainder **4** and carry over the quotient **1** to the next higher place

**Since we're in base-7:** - 11 in base-10 is equal to **1** group of **7** (since \(11 \div 7 = 1\) with a remainder of 4). - Write down the remainder **4** and carry over the quotient **1** to the next higher place

work page

[15] [15]

**Add the carried over value to the next higher place:** - The carried over **1** is added to the next (sevens) place: 1 (from the carry) + 1 (from the carry) = **2**

work page

[16] [16]

**Write down the final result:** Combining all the results, we get **116** in base-7. **Final Answer:** \[ \boxed{116} \] Listing 2: Response from DS-LLaMA-8B with ”Okay” prefix <think> Okay, so I have this problem here: I need to add two numbers, 66 and 50, but both numbers are in base-7. Hmm, base-7, that's not our usual base-10 system we use every day....

work page

[17] [17]

Add the rightmost digits: 6 + 0 = 6 (no carry-over)

work page

[18] [18]

Since we're in base-7, 11 divided by 7 is 1 with a remainder of 4

Add the next digits: 6 + 5 = 11 in base-10. Since we're in base-7, 11 divided by 7 is 1 with a remainder of 4. Write down 4 and carry over 1

work page

[19] [19]

Result: 1 4 6 in base-7

Add the carried-over 1 to the front. Result: 1 4 6 in base-7. So, the sum of 66 and 50 in base-7 is \boxed{146}. Example of CapitalQA Question:What is the capital of Greece? Correct answer:Athens Listing 3: Response from DS-Qwen-1.5B with ”The” prefix <think> The capital of Greece is Athens. It is the most populous city in the country and the capital of b...

work page