hub

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem · 2024 · arXiv 2306.13649

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.

PHF: Privileged Hidden Flow for On-Policy Self-Distillation

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

PHF distills token-to-token transition directions and trajectory geometry in hidden states during on-policy self-distillation, reporting 1.5-2.2 point gains on Average@12 for Qwen3-1.7B/4B/8B over reproduced OPSD baseline under a 100-step schedule.

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

RAFT improves domain accuracy by 23.2% over standard SFT while recovering 18.2% and 10.2% relative performance on MS-Bench and IFEval through refined supervision and trajectory-preserving distillation.

Adversarial Dual On-Policy Distillation from Expressive Teacher

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

FA-OPD co-trains a flow-matching teacher and MLP student via adversarial dual on-policy distillation, improving robustness over baselines on six robot benchmarks with noisy or limited demonstrations.

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

TIP: Token Importance in On-Policy Distillation

cs.LG · 2026-04-15 · unverdicted · novelty 6.0 · 3 refs

A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

cs.AI · 2025-09-15 · unverdicted · novelty 6.0

A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

cs.CL · 2023-12-10 · unverdicted · novelty 6.0

ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

MiniLLM: On-Policy Distillation of Large Language Models

cs.CL · 2023-06-14 · conditional · novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

DenseSteer is an inference-time steering framework that improves small LLMs' accuracy on math reasoning by modulating representations toward dense reasoning patterns with fewer but higher-density steps.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

A Brief Overview: On-Policy Self-Distillation In Large Language Models

cs.HC · 2026-05-18 · unverdicted · novelty 2.0 · 2 refs

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

citing papers explorer

Showing 20 of 20 citing papers.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 3
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 1
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors cs.CL · 2026-04-23 · unverdicted · none · ref 1
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
PHF: Privileged Hidden Flow for On-Policy Self-Distillation cs.AI · 2026-06-28 · unverdicted · none · ref 1
PHF distills token-to-token transition directions and trajectory geometry in hidden states during on-policy self-distillation, reporting 1.5-2.2 point gains on Average@12 for Qwen3-1.7B/4B/8B over reproduced OPSD baseline under a 100-step schedule.
RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting cs.LG · 2026-05-29 · unverdicted · none · ref 11
RAFT improves domain accuracy by 23.2% over standard SFT while recovering 18.2% and 10.2% relative performance on MS-Bench and IFEval through refined supervision and trajectory-preserving distillation.
Adversarial Dual On-Policy Distillation from Expressive Teacher cs.LG · 2026-05-26 · unverdicted · none · ref 1
FA-OPD co-trains a flow-matching teacher and MLP student via adversarial dual on-policy distillation, improving robustness over baselines on six robot benchmarks with noisy or limited demonstrations.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation cs.CV · 2026-05-25 · unverdicted · none · ref 1
A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning cs.LG · 2026-05-20 · unverdicted · none · ref 2
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 1
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
TIP: Token Importance in On-Policy Distillation cs.LG · 2026-04-15 · unverdicted · none · ref 1 · 3 links
A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction cs.AI · 2025-09-15 · unverdicted · none · ref 1
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 174
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 1
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 2
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents cs.LG · 2026-06-10 · unverdicted · none · ref 31
SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.
DenseSteer: Steering Small Language Models towards Dense Math Reasoning cs.AI · 2026-05-28 · unverdicted · none · ref 2
DenseSteer is an inference-time steering framework that improves small LLMs' accuracy on math reasoning by modulating representations toward dense reasoning patterns with fewer but higher-density steps.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 1
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing cs.LG · 2026-05-07 · unverdicted · none · ref 2
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 132
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A Brief Overview: On-Policy Self-Distillation In Large Language Models cs.HC · 2026-05-18 · unverdicted · none · ref 1 · 2 links
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer