Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

Dengfeng Liu; Huang Liu; Huiyang Li; Jin Yang; Ji Xie; Jucai Zhai; Jun Wang; Kailai Zhang; Ninglun Gu; Qiaobo Hao

arxiv: 2508.09883 · v2 · pith:THZM232Dnew · submitted 2025-08-13 · 💻 cs.LG · cs.AI

Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

Xiaojun Wu , Xiaoguang Jiang , Huiyang Li , Jucai Zhai , Dengfeng Liu , Qiaobo Hao , Huang Liu , Zhiguo Yang

show 6 more authors

Ji Xie Ninglun Gu Jin Yang Kailai Zhang Yelun Bao Jun Wang

This is my paper

classification 💻 cs.LG cs.AI

keywords reasoningdistillationmodelscalingcapabilitieslearningmethodsteacher

0 comments

read the original abstract

Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
cs.CL 2026-01 conditional novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework
cs.CL 2026-01 unverdicted novelty 6.0

COMPACT adaptively fuses multi-teacher CoT supervisions using graph-based consensus, mutual-information adaptability, and loss-based difficulty metrics to improve small language model reasoning performance while mitig...
Training-Trajectory-Aware Token Selection
cs.CL 2026-01 unverdicted novelty 6.0

Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
cs.CL 2026-05 unverdicted novelty 5.0

MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.