Title resolution pending

Distilling the Knowledge in a Neural Network , author= · 2015

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

SRA: Span Representation Alignment for Large Language Model Distillation

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

SRA reframes CTKD by aligning attention-weighted span centers of mass in a multi-particle system model with geometric regularization and span logit distillation, claiming consistent outperformance over baselines.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD uses dynamic mixing of the model's expert and naive conditionals to create distribution-aligned supervision that improves the memorization-retention tradeoff over standard SFT.

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

SHRED performs retain-set-free unlearning by selecting lowest-probability tokens as forget positions and applying a single KL self-distillation objective that demotes logits only at those positions.

Parallel Prefix Verification for Speculative Generation

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

MTA is a distillation method that aligns teacher-student LLM representations along their transformation trajectories using layer-adaptive granularities and dynamic structural plus hidden representation alignment losses.

citing papers explorer

Showing 1 of 1 citing paper after filters.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 42
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer