pith. sign in

arxiv: 2402.03898 · v2 · pith:WNZ4XCELnew · submitted 2024-02-06 · 💻 cs.CL · cs.AI· cs.LG

DistiLLM: Towards Streamlined Distillation for Large Language Models

classification 💻 cs.CL cs.AIcs.LG
keywords modelsdistillmlanguagemodelauto-regressivedistillationlargemethods
0
0 comments X
read the original abstract

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

    cs.CL 2026-06 unverdicted novelty 7.0

    ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...

  2. Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

  3. Visual-Advantage On-Policy Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

  4. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  5. MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

    cs.CL 2026-05 unverdicted novelty 7.0

    MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.

  6. PHF: Privileged Hidden Flow for On-Policy Self-Distillation

    cs.AI 2026-06 unverdicted novelty 6.0

    PHF distills token-to-token transition directions and trajectory geometry in hidden states during on-policy self-distillation, reporting 1.5-2.2 point gains on Average@12 for Qwen3-1.7B/4B/8B over reproduced OPSD base...

  7. RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

    cs.LG 2026-06 unverdicted novelty 6.0

    RLCSD contrasts teacher-student distributional gaps under correct versus wrong hints to suppress privilege-induced style drift and concentrate supervision on task tokens, outperforming GRPO and prior OPSD on Qwen3 and...

  8. Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex s...

  9. When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...

  10. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  11. Hybrid Policy Distillation for LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...

  12. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  13. PriFT: Prior-Support Guided Supervised Fine-Tuning

    cs.CL 2026-06 unverdicted novelty 5.0

    PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.

  14. MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

    cs.CV 2026-05 unverdicted novelty 5.0

    MoASE++ combines activation sparsity experts with domain-adaptive on-policy distillation to achieve state-of-the-art continual test-time adaptation on image classification and segmentation benchmarks.

  15. Curriculum Learning-Guided Progressive Distillation in Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    CLPD improves LLM distillation for reasoning by combining explicit data curriculum with progressive teacher scheduling of increasing capacity.

  16. Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    cs.LG 2026-05 unverdicted novelty 5.0

    NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

  17. MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

    cs.CL 2026-05 unverdicted novelty 5.0

    MTA is a distillation method that aligns teacher-student LLM representations along their transformation trajectories using layer-adaptive granularities and dynamic structural plus hidden representation alignment losses.