On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
Pith reviewed 2026-05-13 22:04 UTC · model grok-4.3
The pith
SFT on long chain-of-thought data with divergent branching patterns achieves lower training loss yet worse generalization on reasoning benchmarks than data with convergent deductive patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SFT on DeepSeek-R1-0528 long CoT data achieves remarkably lower training loss yet significantly worse generalization on reasoning benchmarks than training on gpt-oss-120b data on identical problems, because DeepSeek trajectories exhibit divergent and branch-heavy exploration patterns that cause trained models to pursue redundant exploratory branches instead of reaching correct solutions, while gpt-oss trajectories are highly convergent and deductive; filtering frequently branching trajectories from DeepSeek-R1-0528 data improves reasoning performance by up to 5.1 percent on AIME25 and 3.6 percent on average across five benchmarks.
What carries the argument
Reasoning pattern difference between divergent branch-heavy trajectories and convergent deductive ones, which determines whether the fine-tuned model inherits inefficient exploration behaviors during inference.
If this is right
- Models trained on divergent branch-heavy trajectories inherit redundant exploration that reduces their ability to reach correct solutions on reasoning tasks.
- Filtering trajectories with frequent branching from divergent sources improves generalization on math reasoning benchmarks such as AIME25.
- Training loss in long CoT SFT is not a reliable predictor of out-of-distribution generalization performance.
- Convergent deductive reasoning patterns in training data produce models that solve problems more reliably without getting trapped in unproductive branches.
Where Pith is reading between the lines
- Data curation that prioritizes convergent deductive paths over exploratory branches could become a standard step when building reasoning models from synthetic trajectories.
- The same branching filter might close generalization gaps in other long CoT training pipelines that use data from different generators.
- Extending the analysis to non-math domains such as code generation or scientific reasoning would test whether the pattern effect is domain-specific.
- If reinforcement learning stages also inherit exploration styles from SFT data, applying similar filters before RL could compound the gains.
- keywords:[
- chain-of-thought
- supervised fine-tuning
- reasoning patterns
Load-bearing premise
The observed generalization gap between the two data sources is caused by their difference in reasoning patterns rather than other uncontrolled differences in trajectory quality or model-specific generation artifacts.
What would settle it
Training models on DeepSeek-R1-0528 subsets matched to gpt-oss-120b on trajectory length, verification score, and problem distribution but differing only in branching frequency, then checking whether the generalization gap closes, would test the claim directly.
Figures
read the original abstract
Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that SFT on long CoT trajectories from DeepSeek-R1-0528 yields lower training loss but significantly worse generalization on reasoning benchmarks than trajectories from gpt-oss-120b (same problems), due to divergent branch-heavy exploration versus convergent deductive patterns. Filtering frequently branching trajectories from the DeepSeek data is proposed as a remedy, yielding gains of up to 5.1% on AIME25 and 3.6% average across five benchmarks.
Significance. If the causal attribution to reasoning patterns holds after controlling for confounds, the work is significant for highlighting how specific trajectory behaviors in long-CoT data affect generalization in reasoning models. The empirical comparison and simple filtering method could inform data curation practices for SFT of large reasoning models.
major comments (2)
- [Results section / filtering experiment] The filtering experiment (described in the abstract and results) does not control for trajectory length or step count. Branching frequency is likely correlated with longer exploratory traces; without reporting length statistics for the full vs. filtered sets or an ablation that holds length fixed while varying only branching, the reported gains (5.1% AIME25, 3.6% average) cannot be isolated to removal of branching behavior.
- [Experimental setup and abstract] The central paradox claim lacks details on statistical significance, exact benchmark splits, and controls for trajectory correctness rate or length distribution (abstract and experimental setup). These omissions leave the generalization discrepancy only partially supported and weaken attribution to reasoning patterns versus other uncontrolled differences.
minor comments (2)
- [Abstract] Clarify the exact five benchmarks used for the 3.6% average improvement and list them explicitly.
- [Analysis section] Provide quantitative definitions or metrics for 'frequently branching' and 'convergent deductive' trajectories to improve reproducibility of the pattern analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will incorporate clarifications and additional analyses in the revised manuscript to better support our claims.
read point-by-point responses
-
Referee: [Results section / filtering experiment] The filtering experiment (described in the abstract and results) does not control for trajectory length or step count. Branching frequency is likely correlated with longer exploratory traces; without reporting length statistics for the full vs. filtered sets or an ablation that holds length fixed while varying only branching, the reported gains (5.1% AIME25, 3.6% average) cannot be isolated to removal of branching behavior.
Authors: We thank the referee for highlighting this potential confound. Branching frequency can indeed correlate with trajectory length. In the revised manuscript, we will report average lengths and step counts for the full DeepSeek-R1-0528 set versus the filtered subset. We will also add an ablation that subsamples trajectories to match length distributions while varying branching frequency, allowing us to better isolate the contribution of divergent patterns to the observed generalization gains. revision: yes
-
Referee: [Experimental setup and abstract] The central paradox claim lacks details on statistical significance, exact benchmark splits, and controls for trajectory correctness rate or length distribution (abstract and experimental setup). These omissions leave the generalization discrepancy only partially supported and weaken attribution to reasoning patterns versus other uncontrolled differences.
Authors: We agree that these details strengthen the presentation. All trajectories from both sources are verified correct, yielding a 100% correctness rate. The benchmarks employ standard community splits (official AIME25 test set and equivalent for the remaining four). In revision, we will explicitly document the splits and correctness control in the experimental setup, report length distributions, and add statistical significance measures (e.g., bootstrap confidence intervals or paired tests) for the performance differences. These additions will more firmly attribute the discrepancy to reasoning patterns. revision: yes
Circularity Check
No significant circularity in empirical comparison
full rationale
The paper is a purely empirical study comparing SFT on verified CoT trajectories from two models (DeepSeek-R1-0528 and gpt-oss-120b) on identical problem sets. It reports lower training loss yet worse generalization for one source, attributes this to observed differences in reasoning patterns (divergent branching vs convergent deductive), and shows post-hoc filtering of branching trajectories yields gains on benchmarks. No equations, derivations, fitted parameters, or self-citations reduce any claim to its inputs by construction. The filtering rule is an empirical remedy, not a prediction derived from the same data. The central claims rest on direct experimental measurements and are self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2502.17387. Marthe Ballon, Brecht Verbeken, Vincent Ginis, and Andres Algaba. Probing the trajectories of reasoning traces in large language models.arXiv preprint arXiv:2601.23163, 2026. Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math compet...
-
[2]
doi: 10.18653/v1/2024.acl-long
URLhttps://openreview.net/forum?id=VnSlfeRCaU. Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=F72ximsx7C1. ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads.https://huggingface....
-
[3]
URLhttps://aclanthology.org/2024.acl-long.320/. Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al. The molecular structure of thought: Mapping the topology of long chain-of-thought reasoning.arXiv preprint arXiv:2601.06002, 2026a. Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Bl...
-
[4]
gpt-oss-120b & gpt-oss-20b Model Card
URLhttps://openreview.net/forum?id=o1g8NWkxqf. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan C...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
and so on. For each problem, we query both DeepSeek-R1-0528 and gpt-oss-120b to generate their respective Long CoT trajectories. To rigorously control data quality, we apply a rule-based verification pipeline to ensure thatall trajectories used in our experiments success- fully arrive at the correct final answer. This step guarantees that any observed per...
work page 2025
-
[6]
It represents the ”divergent” phase of reasoning
Propose (Hypothesis & Exploration) -Definition:The model is exploring a new idea, setting up a hypothesis, or suggesting an alternative path. It represents the ”divergent” phase of reasoning. -Significance:High frequency of this label indicates a highly exploratory, tree-like search structure
-
[7]
It represents the ”convergent, linear” phase
Deduce (Sequential Deduction) -Definition:The model is executing a mathematical operation or making a direct logical inference based on the *immediately preceding* established facts or the current hypothesis. It represents the ”convergent, linear” phase. -Significance:Continuous sequences of this label indicate a dense, high-dependency deductive chain
-
[8]
Verify (Self-Reflection & Checking) -Definition:The model pauses its forward progression to double-check an intermediate calculation, verify a condition, or assess if the current path makes sensewithout yet abandoning it
-
[9]
Critical Rules For Annotation:
Backtrack (Error Correction & Path Abandonment) -Definition:The model explicitly realizes an error or a dead end, rejects the current reasoning branch, and retreats to a previous state or prepares to start over. Critical Rules For Annotation:
-
[10]
The [PREVIOUS STEP] is strictly for context
Focus STRICTLY on the ACTION in [CURRENT STEP]. The [PREVIOUS STEP] is strictly for context. Do NOT assign a label based on the tone or action of the [PREVIOUS STEP]
-
[11]
Progress” Test (Crucial for distinguishing Propose vs
The ”Pivot vs. Progress” Test (Crucial for distinguishing Propose vs. Deduce). -Progress (Label → Deduce):If [CURRENT STEP] simply executes the math, unpacks the logic, or states the direct consequence of the [PREVIOUS STEP] (e.g., solving the equation just proposed), it is making forward progress. *Key signs: ”Thus”, ”So”, ”Which means”, ”That would chan...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.