Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

C\'eline Hudelot; Hippolyte Gisserot-Boukhlef; Kevin El Haddad; Nicolas Boizard; Pierre Colombo

arxiv: 2509.22193 · v2 · pith:GSQZVRQPnew · submitted 2025-09-26 · 💻 cs.CL

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Nicolas Boizard , Hippolyte Gisserot-Boukhlef , Kevin El Haddad , C\'eline Hudelot , Pierre Colombo This is my paper

classification 💻 cs.CL

keywords reasoningcomputedistillationfrontiermodelsonlyoutputspareto

0 comments

read the original abstract

Distilling reasoning traces from strong teacher models has become the standard recipe for building capable small language models. Yet reasoning traces are 5-20$\times$ longer than standard instruction fine-tuning (IFT) outputs, meaning every practitioner who chooses reasoning distillation implicitly forgoes training a larger IFT model on the same compute budget. Whether this trade-off is worthwhile remains unaddressed. We study it with a controlled experiment: a single teacher generates paired IFT and reasoning outputs for identical prompts by toggling only its reasoning mode, isolating supervision format as the sole variable. Training students at five scales (0.5B to 14B) and evaluating on 18 benchmarks, we find that at matched FLOPs, IFT lies on or near the Pareto frontier across the majority of configurations. Reasoning reaches the Pareto frontier only on open-ended tasks at 7B and above. Even there, a sequential curriculum mixing just 25-50\% reasoning data with IFT captures most of the accuracy benefit at far lower compute cost.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Block Diffusion Language Models
cs.LG 2026-06 unverdicted novelty 6.0

MBD-LMs post-train BD-LMs using MultiTF on bounded noise-groups with randomized schedulers and Block Buffer decoding to increase average TPF from 3.47 to 6.19 with accuracy rising to 81.03%.