arxiv: 2604.08880 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

Tokio Kajitsuka , Ukyo Honda , Sho Takase

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords chain-of-thought distillationcapacity gapknowledge distillationstudent baselinereasoning performanceevaluation protocolteacher selection

0 comments

The pith

CoT distillation often degrades student performance below its own pre-distillation baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper re-examines claims about a capacity gap that supposedly blocks chain-of-thought distillation when teachers and students differ greatly in ability. It finds that in standard experimental setups the distillation step itself frequently lowers the student's reasoning performance compared with the level the student already reached before any distillation occurred. Earlier studies missed this pattern because they reported only post-distillation results and skipped the baseline comparison. When a fuller protocol is applied that keeps the baseline and tests teachers of varying quality, the capacity gap no longer appears as the decisive factor across tasks.

Core claim

Prior reports of a capacity gap in chain-of-thought distillation rest on comparisons that omit the student's pre-distillation performance. In the settings examined, distillation commonly produces results worse than that baseline, and the size of the teacher-student capability difference does not consistently dominate outcomes once teachers of substantially different performance levels are considered.

What carries the argument

A realistic evaluation protocol that measures performance against the student's pre-distillation baseline and varies the performance level of candidate teachers.

If this is right

Any assessment of distillation must include the student's performance before distillation as a reference point.
Teacher selection should be guided by the teacher's absolute performance rather than by an assumed match in capability.
Capacity-gap concerns may be secondary to other factors on many tasks once baselines are taken into account.
Practical use requires verifying that distillation improves upon the student's existing capabilities rather than assuming it will.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Distillation procedures may need targeted changes to avoid harming the student's starting performance.
Similar baseline checks could be applied to other forms of knowledge transfer or model compression.
Task-by-task variation indicates that no single rule for choosing teachers will hold without extensive testing.

Load-bearing premise

The commonly used experimental settings re-examined here are representative of how chain-of-thought distillation is applied in practice.

What would settle it

A broad set of tasks and teacher-student pairs in which distillation produces consistent gains over the pre-distillation baseline even when capability differences are large.

Figures

Figures reproduced from arXiv: 2604.08880 by Sho Takase, Tokio Kajitsuka, Ukyo Honda.

**Figure 2.** Figure 2: Results under the small–large setting on 15 selected BBH tasks with our practical evaluation protocol. We compare the pre-distillation baseline with students distilled from Qwen2.5-14B-Instruct (Small Teacher) and Qwen2.5-72B-Instruct (Large Teacher). Teacher Ceiling indicates the few-shot performance of each teacher model. Tasks with a gray background indicate cases where at least one distilled model unde… view at source ↗

**Figure 3.** Figure 3: Results under the short–long setting on 15 selected BBH tasks with our practical evaluation protocol. We compare the pre-distillation baseline with students distilled from Qwen2.5-32B-Instruct (Short Teacher) and QwQ-32B-Preview (Long Teacher). Teacher Ceiling indicates the few-shot performance of each teacher model. Tasks with a gray background indicate cases where at least one distilled model underperfor… view at source ↗

**Figure 4.** Figure 4: Effect of effective batch size alignment on [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Results on 5 selected BBH tasks with Gemma-2 teachers and Qwen2.5 students. We compare the pre-distillation baseline with students distilled from Gemma-2-9B-it (Small Teacher) and Gemma-2-27B-it (Large Teacher). Teacher Ceiling indicates the few-shot performance of each teacher model. Tasks with a red background indicate cases where the results did not follow the capacity gap hypothesis. shrank from 3.50 t… view at source ↗

read the original abstract

Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows CoT distillation often drops below the student's pre-distillation baseline in re-examined setups, but lacks controls to confirm this stems from capacity gap rather than plain fine-tuning effects.

read the letter

The main point here is that prior CoT distillation papers missed a basic comparison: the student model before any distillation at all. In the common settings they revisit, distillation frequently makes things worse than that starting point, which hides the real cost when you only look at post-distillation numbers. They also report that capacity gap effects do not show up consistently once teachers vary a lot in strength, and they suggest a protocol that includes the pre-baseline for more realistic checks.

Referee Report

2 major / 2 minor

Summary. The paper re-examines commonly used experimental settings for chain-of-thought (CoT) distillation from a practical perspective. It claims that CoT distillation frequently degrades student model performance relative to the pre-distillation baseline—an effect masked by post-distillation-only comparisons in prior work. The authors propose a more realistic evaluation protocol and conclude that capacity-gap effects do not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in capability, and offer guidance on teacher-student pair selection.

Significance. If the central empirical findings hold after addressing controls, the work would be significant for practical CoT distillation by demonstrating that larger teachers are not always preferable and that baseline comparisons are essential. It provides actionable guidance on model selection and evaluation protocols in a widely used technique.

major comments (2)

[Experimental re-examination and evaluation protocol] The central claim that observed performance degradation versus the pre-distillation baseline is attributable to the CoT capacity gap (rather than generic fine-tuning effects) is load-bearing for the practical recommendations. The re-examination of common settings must include explicit controls such as matching training steps, data volume, and optimization schedule to a non-CoT fine-tuning baseline on identical student data; without these, degradation could arise from distribution shift or catastrophic forgetting instead.
[Results and analysis sections] The manuscript should report statistical significance, number of random seeds, and variance across runs for all degradation claims versus baselines, as single-run or uncontrolled results undermine the assertion that capacity-gap effects 'do not consistently dominate' across tasks.

minor comments (2)

Clarify the exact differences between the proposed realistic evaluation protocol and prior protocols, including any new metrics or comparison baselines introduced.
Ensure tables reporting performance include both pre- and post-distillation numbers for all student models to make the degradation effect immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful feedback on our manuscript. The suggestions for improved experimental controls and statistical reporting are valuable, and we have incorporated revisions to address them. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Experimental re-examination and evaluation protocol] The central claim that observed performance degradation versus the pre-distillation baseline is attributable to the CoT capacity gap (rather than generic fine-tuning effects) is load-bearing for the practical recommendations. The re-examination of common settings must include explicit controls such as matching training steps, data volume, and optimization schedule to a non-CoT fine-tuning baseline on identical student data; without these, degradation could arise from distribution shift or catastrophic forgetting instead.

Authors: We appreciate this point and agree that distinguishing CoT-specific effects from generic fine-tuning is important for rigorous attribution. Our original experiments focused on the practical scenario where practitioners compare CoT-distilled models to their pre-distillation baselines, which is the relevant baseline in many applications. However, to address the concern, we have added a new set of control experiments in the revised manuscript. These include non-CoT fine-tuning on the same student data with matched training steps, data volume, and optimization schedules. The results show that while some degradation occurs due to fine-tuning in general, the capacity gap effects in CoT distillation are more significant when teacher-student mismatches are large, consistent with our claims. We have updated the analysis sections accordingly. revision: yes
Referee: [Results and analysis sections] The manuscript should report statistical significance, number of random seeds, and variance across runs for all degradation claims versus baselines, as single-run or uncontrolled results undermine the assertion that capacity-gap effects 'do not consistently dominate' across tasks.

Authors: We thank the referee for emphasizing the need for statistical rigor. In the revised manuscript, we have expanded the results to include experiments run with 5 random seeds for the key settings. We report means, standard deviations, and p-values from statistical tests (using paired t-tests where appropriate) for comparisons against baselines. The variance is generally low, and the findings that capacity gap effects do not consistently dominate remain supported, with statistical significance for the observed patterns across tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical re-examination with no derivations or self-referential claims

full rationale

The paper conducts an empirical study by re-running CoT distillation experiments under commonly used settings and comparing against pre-distillation baselines. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the central claims. All reported findings are direct experimental outcomes presented as observations rather than quantities defined in terms of the result itself, rendering the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical derivations; no free parameters, axioms, or invented entities are introduced or relied upon beyond standard machine learning assumptions about experimental controls.

pith-pipeline@v0.9.0 · 5422 in / 1055 out tokens · 80602 ms · 2026-05-10T17:35:57.773943+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Curriculum Learning-Guided Progressive Distillation in Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

CLPD improves LLM distillation for reasoning by combining explicit data curriculum with progressive teacher scheduling of increasing capacity.

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 15094–15119, Vienna, Austria

Unveiling the key factors for distilling chain- of-thought reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 15094–15119, Vienna, Austria. Association for Com- putational Linguistics. Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. InProceedings of the IEEE/CVF international confe...

2025
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168v2. Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, and Guoren Wang. 2024. Keypoint-based progressive chain-of-thought distillation for LLMs. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Unifying distillation and privileged information

Distilling many-shot in-context learning into a cheat sheet. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 17158–17178, Suzhou, China. Association for Com- putational Linguistics. Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023....

work page Pith review arXiv 2025