Title resolution pending

Distilling the Knowledge in a Neural Network , author= · 2015

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

SHRED achieves retain-set-free LLM unlearning by selecting high-Shannon-information tokens for logit demotion in a single self-distillation KL objective, yielding a superior forget-utility Pareto front on four benchmarks.

Parallel Prefix Verification for Speculative Generation

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

SRA: Span Representation Alignment for Large Language Model Distillation

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

SRA reframes cross-tokenizer LLM distillation as alignment of attention-weighted span centers of mass in a multi-particle dynamical system and reports consistent gains over prior CTKD baselines.

citing papers explorer

Showing 12 of 12 citing papers.

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations cs.LG · 2026-05-20 · unverdicted · none · ref 94
CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 11
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 45
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unverdicted · none · ref 1
MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 42
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization cs.LG · 2026-05-19 · unverdicted · none · ref 34
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 19
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection cs.CL · 2026-05-16 · unverdicted · none · ref 20 · 2 links
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion cs.LG · 2026-05-08 · unverdicted · none · ref 8
SHRED achieves retain-set-free LLM unlearning by selecting high-Shannon-information tokens for logit demotion in a single self-distillation KL objective, yielding a superior forget-utility Pareto front on four benchmarks.
Parallel Prefix Verification for Speculative Generation cs.AI · 2026-05-05 · unverdicted · none · ref 29
PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 53 · 2 links
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
SRA: Span Representation Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unverdicted · none · ref 1
SRA reframes cross-tokenizer LLM distillation as alignment of attention-weighted span centers of mass in a multi-particle dynamical system and reports consistent gains over prior CTKD baselines.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer