CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
SHRED achieves retain-set-free LLM unlearning by selecting high-Shannon-information tokens for logit demotion in a single self-distillation KL objective, yielding a superior forget-utility Pareto front on four benchmarks.
PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
SRA reframes cross-tokenizer LLM distillation as alignment of attention-weighted span centers of mass in a multi-particle dynamical system and reports consistent gains over prior CTKD baselines.
citing papers explorer
-
Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations
CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.
-
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
-
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
-
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
-
SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion
SHRED achieves retain-set-free LLM unlearning by selecting high-Shannon-information tokens for logit demotion in a single self-distillation KL objective, yielding a superior forget-utility Pareto front on four benchmarks.
-
Parallel Prefix Verification for Speculative Generation
PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
-
SRA: Span Representation Alignment for Large Language Model Distillation
SRA reframes cross-tokenizer LLM distillation as alignment of attention-weighted span centers of mass in a multi-particle dynamical system and reports consistent gains over prior CTKD baselines.