pith. sign in

arxiv: 2604.01702 · v2 · submitted 2026-04-02 · 💻 cs.CL

On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

Pith reviewed 2026-05-13 22:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords textttreasoninggeneralizationtrajectoriesdeepseek-r1-0528modelsperformancegpt-oss-120b
0
0 comments X

The pith

SFT on long chain-of-thought data with divergent branching patterns achieves lower training loss yet worse generalization on reasoning benchmarks than data with convergent deductive patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares supervised fine-tuning on verified long chain-of-thought trajectories from two sources while holding problem sets fixed. DeepSeek-R1-0528 data produces lower training loss but poorer results on reasoning benchmarks than gpt-oss-120b data. Analysis of token-level losses and step-level behaviors shows the sources differ in reasoning patterns, with DeepSeek trajectories favoring divergent branch-heavy exploration and gpt-oss trajectories favoring convergent deductive steps. Models trained on the former inherit inefficient exploration that traps them in redundant branches. Filtering out frequently branching trajectories from the DeepSeek data raises performance by up to 5.1 percent on AIME25 and 3.6 percent on average across five benchmarks.

Core claim

SFT on DeepSeek-R1-0528 long CoT data achieves remarkably lower training loss yet significantly worse generalization on reasoning benchmarks than training on gpt-oss-120b data on identical problems, because DeepSeek trajectories exhibit divergent and branch-heavy exploration patterns that cause trained models to pursue redundant exploratory branches instead of reaching correct solutions, while gpt-oss trajectories are highly convergent and deductive; filtering frequently branching trajectories from DeepSeek-R1-0528 data improves reasoning performance by up to 5.1 percent on AIME25 and 3.6 percent on average across five benchmarks.

What carries the argument

Reasoning pattern difference between divergent branch-heavy trajectories and convergent deductive ones, which determines whether the fine-tuned model inherits inefficient exploration behaviors during inference.

If this is right

  • Models trained on divergent branch-heavy trajectories inherit redundant exploration that reduces their ability to reach correct solutions on reasoning tasks.
  • Filtering trajectories with frequent branching from divergent sources improves generalization on math reasoning benchmarks such as AIME25.
  • Training loss in long CoT SFT is not a reliable predictor of out-of-distribution generalization performance.
  • Convergent deductive reasoning patterns in training data produce models that solve problems more reliably without getting trapped in unproductive branches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data curation that prioritizes convergent deductive paths over exploratory branches could become a standard step when building reasoning models from synthetic trajectories.
  • The same branching filter might close generalization gaps in other long CoT training pipelines that use data from different generators.
  • Extending the analysis to non-math domains such as code generation or scientific reasoning would test whether the pattern effect is domain-specific.
  • If reinforcement learning stages also inherit exploration styles from SFT data, applying similar filters before RL could compound the gains.
  • keywords:[
  • chain-of-thought
  • supervised fine-tuning
  • reasoning patterns

Load-bearing premise

The observed generalization gap between the two data sources is caused by their difference in reasoning patterns rather than other uncontrolled differences in trajectory quality or model-specific generation artifacts.

What would settle it

Training models on DeepSeek-R1-0528 subsets matched to gpt-oss-120b on trajectory length, verification score, and problem distribution but differing only in branching frequency, then checking whether the generalization gap closes, would test the claim directly.

Figures

Figures reproduced from arXiv: 2604.01702 by Defu Lian, Gangwei Jiang, Linqi Song, Ranran Shen, Wei Wang, Xiangyu Xi, Ying Wei, Zhaoyi Li, Zhengyu Chen.

Figure 1
Figure 1. Figure 1: (a ∼ d): SFT training loss comparison of different models trained on long CoT trajectories of DeepSeek-R1 and gpt-oss-120b. (e) and (f): average testing performance on five benchmarks with varying training steps and inference context length. Blue/red curves refer to experiments with gpt-oss-120b/DeepSeek-R1-generated data, respectively. we collect a high-quality dataset comprising approximately 500,000 cha… view at source ↗
Figure 2
Figure 2. Figure 2: Token-level SFT loss analysis for the Qwen3-8B model. (a) show the token-level [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning behavior distributions (a, c) and transition matrices (b, d) for reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance change ratio ((Accoriginal − Accretrain)/Accoriginal) on five benchmarks after randomly deleting 10% reasoning steps in each training trajectory. Blue/red bars represent experiments with DeepSeek-R1/gpt-oss-120b-generated data, respectively. dundancy, we design a comparison experiment: for each training trajectory in both datasets, we randomly delete 10% of its reasoning steps and retrain the b… view at source ↗
Figure 5
Figure 5. Figure 5: The prompt template for annotating reasoning steps with four behavior labels. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Token-level SFT loss analysis for the Qwen3-8B model. (a) and (d) show the token [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token-level SFT loss analysis for the Qwen2.5-7B model. (a) and (d) show the [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token-level SFT loss analysis for the Llama3.1-8B model. (a) and (d) show the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reasoning behavior analysis. Reasoning behavior distribution ((a) and (c)) and [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance change ratio on five benchmarks (MATH500, AIME24/25, Be [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study: a highly exploratory snippet of a reasoning trajectory generated by [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case study: a highly exploratory snippet of a reasoning trajectory generated by [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that SFT on long CoT trajectories from DeepSeek-R1-0528 yields lower training loss but significantly worse generalization on reasoning benchmarks than trajectories from gpt-oss-120b (same problems), due to divergent branch-heavy exploration versus convergent deductive patterns. Filtering frequently branching trajectories from the DeepSeek data is proposed as a remedy, yielding gains of up to 5.1% on AIME25 and 3.6% average across five benchmarks.

Significance. If the causal attribution to reasoning patterns holds after controlling for confounds, the work is significant for highlighting how specific trajectory behaviors in long-CoT data affect generalization in reasoning models. The empirical comparison and simple filtering method could inform data curation practices for SFT of large reasoning models.

major comments (2)
  1. [Results section / filtering experiment] The filtering experiment (described in the abstract and results) does not control for trajectory length or step count. Branching frequency is likely correlated with longer exploratory traces; without reporting length statistics for the full vs. filtered sets or an ablation that holds length fixed while varying only branching, the reported gains (5.1% AIME25, 3.6% average) cannot be isolated to removal of branching behavior.
  2. [Experimental setup and abstract] The central paradox claim lacks details on statistical significance, exact benchmark splits, and controls for trajectory correctness rate or length distribution (abstract and experimental setup). These omissions leave the generalization discrepancy only partially supported and weaken attribution to reasoning patterns versus other uncontrolled differences.
minor comments (2)
  1. [Abstract] Clarify the exact five benchmarks used for the 3.6% average improvement and list them explicitly.
  2. [Analysis section] Provide quantitative definitions or metrics for 'frequently branching' and 'convergent deductive' trajectories to improve reproducibility of the pattern analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate clarifications and additional analyses in the revised manuscript to better support our claims.

read point-by-point responses
  1. Referee: [Results section / filtering experiment] The filtering experiment (described in the abstract and results) does not control for trajectory length or step count. Branching frequency is likely correlated with longer exploratory traces; without reporting length statistics for the full vs. filtered sets or an ablation that holds length fixed while varying only branching, the reported gains (5.1% AIME25, 3.6% average) cannot be isolated to removal of branching behavior.

    Authors: We thank the referee for highlighting this potential confound. Branching frequency can indeed correlate with trajectory length. In the revised manuscript, we will report average lengths and step counts for the full DeepSeek-R1-0528 set versus the filtered subset. We will also add an ablation that subsamples trajectories to match length distributions while varying branching frequency, allowing us to better isolate the contribution of divergent patterns to the observed generalization gains. revision: yes

  2. Referee: [Experimental setup and abstract] The central paradox claim lacks details on statistical significance, exact benchmark splits, and controls for trajectory correctness rate or length distribution (abstract and experimental setup). These omissions leave the generalization discrepancy only partially supported and weaken attribution to reasoning patterns versus other uncontrolled differences.

    Authors: We agree that these details strengthen the presentation. All trajectories from both sources are verified correct, yielding a 100% correctness rate. The benchmarks employ standard community splits (official AIME25 test set and equivalent for the remaining four). In revision, we will explicitly document the splits and correctness control in the experimental setup, report length distributions, and add statistical significance measures (e.g., bootstrap confidence intervals or paired tests) for the performance differences. These additions will more firmly attribute the discrepancy to reasoning patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison

full rationale

The paper is a purely empirical study comparing SFT on verified CoT trajectories from two models (DeepSeek-R1-0528 and gpt-oss-120b) on identical problem sets. It reports lower training loss yet worse generalization for one source, attributes this to observed differences in reasoning patterns (divergent branching vs convergent deductive), and shows post-hoc filtering of branching trajectories yields gains on benchmarks. No equations, derivations, fitted parameters, or self-citations reduce any claim to its inputs by construction. The filtering rule is an empirical remedy, not a prediction derived from the same data. The central claims rest on direct experimental measurements and are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that branching frequency correlates with generalization failure. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5664 in / 1161 out tokens · 40705 ms · 2026-05-13T22:04:59.208307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

    URLhttps://arxiv.org/abs/2502.17387. Marthe Ballon, Brecht Verbeken, Vincent Ginis, and Andres Algaba. Probing the trajectories of reasoning traces in large language models.arXiv preprint arXiv:2601.23163, 2026. Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math compet...

  2. [2]

    doi: 10.18653/v1/2024.acl-long

    URLhttps://openreview.net/forum?id=VnSlfeRCaU. Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=F72ximsx7C1. ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads.https://huggingface....

  3. [3]

    Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al

    URLhttps://aclanthology.org/2024.acl-long.320/. Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al. The molecular structure of thought: Mapping the topology of long chain-of-thought reasoning.arXiv preprint arXiv:2601.06002, 2026a. Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Bl...

  4. [4]

    gpt-oss-120b & gpt-oss-20b Model Card

    URLhttps://openreview.net/forum?id=o1g8NWkxqf. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan C...

  5. [5]

    divergent

    and so on. For each problem, we query both DeepSeek-R1-0528 and gpt-oss-120b to generate their respective Long CoT trajectories. To rigorously control data quality, we apply a rule-based verification pipeline to ensure thatall trajectories used in our experiments success- fully arrive at the correct final answer. This step guarantees that any observed per...

  6. [6]

    It represents the ”divergent” phase of reasoning

    Propose (Hypothesis & Exploration) -Definition:The model is exploring a new idea, setting up a hypothesis, or suggesting an alternative path. It represents the ”divergent” phase of reasoning. -Significance:High frequency of this label indicates a highly exploratory, tree-like search structure

  7. [7]

    It represents the ”convergent, linear” phase

    Deduce (Sequential Deduction) -Definition:The model is executing a mathematical operation or making a direct logical inference based on the *immediately preceding* established facts or the current hypothesis. It represents the ”convergent, linear” phase. -Significance:Continuous sequences of this label indicate a dense, high-dependency deductive chain

  8. [8]

    Verify (Self-Reflection & Checking) -Definition:The model pauses its forward progression to double-check an intermediate calculation, verify a condition, or assess if the current path makes sensewithout yet abandoning it

  9. [9]

    Critical Rules For Annotation:

    Backtrack (Error Correction & Path Abandonment) -Definition:The model explicitly realizes an error or a dead end, rejects the current reasoning branch, and retreats to a previous state or prepares to start over. Critical Rules For Annotation:

  10. [10]

    The [PREVIOUS STEP] is strictly for context

    Focus STRICTLY on the ACTION in [CURRENT STEP]. The [PREVIOUS STEP] is strictly for context. Do NOT assign a label based on the tone or action of the [PREVIOUS STEP]

  11. [11]

    Progress” Test (Crucial for distinguishing Propose vs

    The ”Pivot vs. Progress” Test (Crucial for distinguishing Propose vs. Deduce). -Progress (Label → Deduce):If [CURRENT STEP] simply executes the math, unpacks the logic, or states the direct consequence of the [PREVIOUS STEP] (e.g., solving the equation just proposed), it is making forward progress. *Key signs: ”Thus”, ”So”, ”Which means”, ”That would chan...