hub

GKD: Generalized knowledge distillation for auto-regressive se- quence models.arXiv preprint arXiv:2306.13649

URL https: //arxiv · 2024 · arXiv 2306.13649

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

TIP: Token Importance in On-Policy Distillation

cs.LG · 2026-04-15 · unverdicted · novelty 6.0 · 3 refs

A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

cs.AI · 2025-09-15 · unverdicted · novelty 6.0

A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

cs.CL · 2023-12-10 · unverdicted · novelty 6.0

ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

MiniLLM: On-Policy Distillation of Large Language Models

cs.CL · 2023-06-14 · conditional · novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

A Brief Overview: On-Policy Self-Distillation In Large Language Models

cs.HC · 2026-05-18 · unverdicted · novelty 2.0 · 2 refs

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

citing papers explorer

Showing 13 of 13 citing papers.

KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 1
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors cs.CL · 2026-04-23 · unverdicted · none · ref 1
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning cs.LG · 2026-05-20 · unverdicted · none · ref 2
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 1
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
TIP: Token Importance in On-Policy Distillation cs.LG · 2026-04-15 · unverdicted · none · ref 1 · 3 links
A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction cs.AI · 2025-09-15 · unverdicted · none · ref 1
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 174
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 1
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 2
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 1
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing cs.LG · 2026-05-07 · unverdicted · none · ref 2
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 132
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A Brief Overview: On-Policy Self-Distillation In Large Language Models cs.HC · 2026-05-18 · unverdicted · none · ref 1 · 2 links
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

GKD: Generalized knowledge distillation for auto-regressive se- quence models.arXiv preprint arXiv:2306.13649

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer