pith. sign in

arxiv: 2606.06021 · v4 · pith:44ZSHURFnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

OPRD: On-Policy Representation Distillation

Pith reviewed 2026-06-28 02:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationrepresentation distillationknowledge distillationlanguage modelsmathematical reasoningcross-architecture transferhidden state alignment
0
0 comments X

The pith

Aligning hidden states on on-policy rollouts lets student models close the performance gap to teachers on math benchmarks where output-space methods plateau.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation can be moved from output token distributions into hidden-state space by aligning representations across selected layers on the same rollouts the student generates. This supplies dense deterministic supervision that removes token-level gradient variance and bypasses the language-model head bottleneck. A sympathetic reader would care because it produces students that reach closer to teacher accuracy on AIME 2024, AIME 2025 and AIMO while training 1.44 times faster and using up to 54 percent less memory. The method further extends to cross-architecture and cross-tokenizer cases by using frozen low-rank projectors that shift alignment away from shared vocabulary.

Core claim

OPRD aligns student and teacher hidden states across selected layers on identical on-policy rollouts, delivering dense deterministic per-layer supervision that removes the high-variance gradient estimator and the LM-head information bottleneck of output-space distillation; the same approach with a frozen projector pair extends the alignment to heterogeneous models that differ in depth, width or tokenizer.

What carries the argument

On-policy hidden-state alignment at selected layers, using a frozen projector pair that exploits low-rank representational structure to bridge arbitrary model mismatches.

If this is right

  • Students reach teacher-level accuracy on competition mathematics where every output-space baseline plateaus below the teacher.
  • Training runs 1.44 times faster and uses up to 54 percent less memory.
  • The method supplies a deterministic per-sample gradient instead of the high-variance token-level estimator of output-space distillation.
  • Successful transfer occurs even when models have no shared vocabulary, via representation-space alignment alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the low-rank structure observation generalizes, representation distillation could become the default channel for transferring knowledge between models that lack a common tokenizer.
  • The deterministic per-layer signal may combine more cleanly with reinforcement-learning objectives than noisy output-space distillation does.
  • The same alignment mechanism could be tested on non-reasoning tasks to check whether intermediate representations carry transferable structure beyond mathematics benchmarks.

Load-bearing premise

Student and teacher hidden states at chosen layers remain meaningfully alignable on the same on-policy rollouts even when the models differ in depth and width.

What would settle it

A controlled run on AIME 2024 in which an OPRD-trained student still trails every output-space baseline by the same margin the paper reports for those baselines.

Figures

Figures reproduced from arXiv: 2606.06021 by Bowen Song, Gang Chen, Guangcheng Zhu, Haobo Wang, Junbo Zhao, Mingxuan Xia, Shenzhi Yang, Weiqiang Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen.

Figure 1
Figure 1. Figure 1: OPRD is strictly Pareto-dominant on accuracy, training time, and GPU memory. Each bubble is a method trained from the same R1-distill-1.5B student against JustRL-1.5B teacher for 500 optimizer steps on 8×A100 GPU (80G) FSDP (§4). Axes carry the two “compute” costs (wall-clock ↓, AIME24 Avg@16 ↑); bubble area encodes the third cost (actor-update ∆peak GPU memory ↓). OPRD (navy bubble) simultaneously dominat… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of OPRD vs. output-space OPD. Both methods share the same on-policy rollout yˆ ∼ πθ(· | x), which is fed to the student (blue, trainable) and the teacher (orange, frozen). OPD extracts supervision after the LM head, comparing output distributions pt and qt via reverse KL on a token subset. OPRD (ours) extracts supervision before the LM head, comparing intermediate hidden states h (l) at select… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of OPRD-Bridge. Top (Stage 1): both models are frozen; PT is obtained via PCA of teacher hidden states and PS is trained to align with PT in R r . After convergence, both projectors are frozen. Bottom (Stage 2): the bridge (PT , PS) serves as a fixed interface; only the student backbone θ is updated via on-policy rollouts through the frozen bridge. principal directions. Beyond those directions, th… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of OPRD vs. OPD baselines on AIME24 (left), AIME25 (middle), and AIMO (right). Top row: OPRD vs. OPD top-1 (sampled-token reverse KL); bottom row: OPRD vs. OPD top-16. Translucent line: raw Avg@16 at each evaluation step; solid line with markers: 5-step centered rolling mean. Within each panel the two methods share the same student initialization, on-policy rollouts, teacher forward passe… view at source ↗
Figure 5
Figure 5. Figure 5: OPRD produces shorter responses than OPD at higher accuracy. Mean rollout length response_length/mean along training for OPRD vs. OPD top-1 vs. OPD top-16 (smoothing window = 15). OPRD converges to ∼5,700 tokens per response, while both OPD variants plateau around ∼7,000 tokens, indicating that hidden-state supervision yields more concise and efficient reasoning chains [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 6
Figure 6. Figure 6: OPRD monotonically increases the student–teacher representation cosine similarity it supervises (higher is better). rep/cosine_similarity on the OPRD-supervised positions along training (smoothing window = 5). The curve rises sharply early and drifts upward steadily thereafter, confirming that (6) is being optimised end-to-end. marginally on AIME25 (+0.5) yet loses ground on AIMO (−0.5). The absence of a c… view at source ↗
Figure 7
Figure 7. Figure 7: Adding OPRD on top of OPD top-1 monotonically lifts accu￾racy. AIME24 avg@16 of LOPD + µ · LOPRD for µ ∈ {0, 1, 10}. Even µ = 1 already surpasses OPD top-16 (47.1); µ= 10 closes the gap to teacher to within 0.6 pt. Internal view: OPRD’s own loss is being optimised end-to-end. A final, internal diagnostic is whether the representation-level loss OPRD is supposed to min￾imise actually decreases along trainin… view at source ↗
Figure 8
Figure 8. Figure 8: The student diverges from the teacher mostly at the end of the response. Cosine similarity between student (R1-distill-1.5B) and teacher (JustRL-1.5B) last-layer hidden states on on-policy rollouts, restricted to either the first k or the last k response tokens, as a function of k (log scale; “ALL” = full response, at which both curves coincide at 95.42% by construction). The first-k curve is nearly teache… view at source ↗
Figure 9
Figure 9. Figure 9: OPRD accelerates the PG-loss phase transition and validates the information bottle￾neck. actor/pg_loss along training for OPD top-1 + OPRD composite runs (LOPD top-1 + µ · LOPRD, µ∈ {0, 1, 10}; smoothing window = 15). All runs show a loss spike (possible phase transi￾tion); OPRD shifts it earlier, indicating accelerated distillation. In late training all curves converge to ≈ 0, yet accuracy differences per… view at source ↗
Figure 10
Figure 10. Figure 10: Adding OPRD to OPD top-16 further aligns student and teacher next-token top-16 sets (higher is better). Validation val-topk/overlap_ratio along training. The two runs are nearly co-located early on, but in the second half of training OPD top-16 plateaus while OPD top-16 + OPRD keeps climbing, the same late-stage divergence that distinguishes the accuracy curves of [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: OPRD accelerates entropy alignment between student and teacher. Per-token entropy of πθ (actor/entropy) and πT (teacher/entropy) on rollout positions along training for OPD top-1 + OPRD composite runs (µ ∈ {0, 1, 10}, left → right). All runs exhibit an early entropy￾increase phase during which the student–teacher gap widens; adding OPRD shifts this phase earlier (coinciding with the PG-loss spike of [PIT… view at source ↗
Figure 12
Figure 12. Figure 12: Cosine similarity between student and teacher representations in the shared R r subspace, before and after Stage 1 bridge training, as a function of rank r. Before training (gray dashed), the two models’ projected representations are essentially orthogonal at all ranks. After training (blue solid), similarity peaks at r= 8 (95.0%) and then decreases monotonically as r increases, reaching 77.0% at r= 2048 … view at source ↗
read the original abstract

On-policy distillation (OPD) supervises the student exclusively in the output space by matching next-token distributions. This paradigm suffers from two limitations: (i) a high-variance gradient estimator whose signal-to-noise ratio collapses as the student approaches the teacher, and (ii) an LM-head information bottleneck that discards the teacher's intermediate hidden states. We propose On-Policy Representation Distillation (OPRD), the first method to lift on-policy distillation into the hidden-state space. OPRD aligns student and teacher representations across selected layers on the same on-policy rollouts, providing dense, deterministic, per-layer supervision while bypassing the LM head entirely. Theoretically, OPRD provides a deterministic per-sample gradient, removing the token-level estimation variance that plagues OPD, and exposes structural information that any output-space objective necessarily discards. Empirically, OPRD closes the student-teacher gap on competition mathematics benchmarks (AIME 2024, AIME 2025, and AIMO), where every output-space baseline plateaus below the teacher, while training 1.44x faster and using up to 54% less memory. We further extend OPRD to the cross-architecture setting via OPRD-Bridge. By exploiting the observation that heterogeneous models share a low-rank representational structure, we construct a frozen projector pair that aligns representations across arbitrary depth and width mismatches, shifting the alignment from the output space (which depends on a shared vocabulary) to the representation space. We validate OPRD-Bridge on both cross-architecture (Qwen3-4B -> Qwen3-1.7B-Base) and cross-tokenizer (Phi-4-mini-reasoning -> Qwen3-1.7B-Base) settings, demonstrating successful knowledge transfer even when the vocabulary-based alignment channel is unavailable. Code: https://github.com/ShenzhiYang2000/OPRD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces On-Policy Representation Distillation (OPRD), which extends on-policy distillation from output-space next-token matching to hidden-state alignment on shared on-policy rollouts. It claims this yields deterministic per-sample gradients (removing token-level variance), bypasses the LM-head bottleneck, closes the student-teacher performance gap on AIME 2024/2025 and AIMO benchmarks where output-space baselines plateau, trains 1.44x faster, and uses up to 54% less memory. It further proposes OPRD-Bridge, which uses a frozen projector pair to enable cross-architecture and cross-tokenizer transfer by exploiting an observed low-rank representational structure shared across heterogeneous models.

Significance. If the empirical results and the low-rank bridge construction hold under scrutiny, the work would offer a concrete advance in LLM distillation by moving supervision into representation space, enabling transfer where vocabulary mismatch blocks output-space methods. The deterministic gradient property is a clear theoretical improvement over standard OPD. The cross-tokenizer result, if substantiated, would be particularly notable for heterogeneous model families.

major comments (2)
  1. [Abstract] Abstract: The central claim for OPRD-Bridge rests on the 'observation that heterogeneous models share a low-rank representational structure' permitting a frozen projector pair to align states across arbitrary depth/width/tokenizer mismatches. No rank measurements, covariance analysis, projector construction details, or ablation on on-policy rollouts are referenced to support that the structure is low-rank, stable, or task-relevant; without this, the claim that representation-space alignment succeeds where output-space fails cannot be evaluated.
  2. [Abstract] Abstract: The empirical claims (closing the gap on AIME 2024/2025/AIMO, 1.44x speedup, 54% memory reduction, successful cross-architecture transfer) are stated without any reference to experimental protocol, baselines, variance across seeds, ablation controls, or statistical significance. These numbers are load-bearing for the practical contribution and cannot be assessed from the given description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below. The full manuscript contains the requested analyses and experimental details; we will revise the abstract to include explicit references to the relevant sections, figures, and tables so that the claims can be evaluated directly from the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim for OPRD-Bridge rests on the 'observation that heterogeneous models share a low-rank representational structure' permitting a frozen projector pair to align states across arbitrary depth/width/tokenizer mismatches. No rank measurements, covariance analysis, projector construction details, or ablation on on-policy rollouts are referenced to support that the structure is low-rank, stable, or task-relevant; without this, the claim that representation-space alignment succeeds where output-space fails cannot be evaluated.

    Authors: We agree that the abstract would be clearer with pointers to the supporting evidence. Section 3.3 and Figure 2 present singular-value spectra on on-policy rollouts demonstrating rapid decay (low effective rank), covariance matrices across model pairs, the SVD-based projector construction, and stability across depths. Section 5.3 contains the on-policy ablation. We will add citations to these in the revised abstract. revision: yes

  2. Referee: [Abstract] Abstract: The empirical claims (closing the gap on AIME 2024/2025/AIMO, 1.44x speedup, 54% memory reduction, successful cross-architecture transfer) are stated without any reference to experimental protocol, baselines, variance across seeds, ablation controls, or statistical significance. These numbers are load-bearing for the practical contribution and cannot be assessed from the given description.

    Authors: The experimental protocol, baselines (standard OPD, output-space KD, and variants), per-seed variance (reported as mean ± std over three seeds), ablation controls, and significance testing appear in Sections 5 and 6 with Tables 1–4 and Figure 4. We will insert references to these sections and tables in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is self-contained

full rationale

The paper introduces OPRD as a new objective that aligns hidden states on on-policy rollouts and extends it via OPRD-Bridge using an empirical low-rank observation for projectors. No equations, fitted parameters, or self-citations are shown reducing the central claims to prior inputs by construction. The method is presented as a direct proposal bypassing output-space bottlenecks, with empirical validation on benchmarks serving as external check rather than internal redefinition. This qualifies as a normal non-circular case under the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on domain assumptions about representation alignment and low-rank structure across models; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Hidden representations at selected layers of student and teacher models are alignable on identical on-policy rollouts.
    This is the core mechanism enabling per-layer supervision without the LM head.
  • domain assumption Heterogeneous models share a low-rank representational structure that permits alignment via a frozen projector pair despite depth and width differences.
    Required for the OPRD-Bridge extension to cross-architecture and cross-tokenizer settings.

pith-pipeline@v0.9.1-grok · 5915 in / 1319 out tokens · 57143 ms · 2026-06-28T02:42:04.601383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 33 canonical work pages · 25 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

  2. [2]

    Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28, 2015

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  4. [4]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  5. [5]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    AI DeepSeek. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  6. [6]

    Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,

    Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026. 26

  7. [7]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  8. [8]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, volume 2024, pages 32694–32717, 2024

  9. [9]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  10. [10]

    Justrl: Scaling a 1.5 b llm with a simple rl recipe

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe. arXiv preprint arXiv:2512.16649, 2025

  11. [11]

    How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

    Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  13. [13]

    Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-opd: Unifying on-policy distillation with a dual-perspective recipe.arXiv preprint arXiv:2605.03677, 2026

  14. [14]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  15. [15]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

  16. [16]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

  17. [17]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174, 2020

  18. [18]

    Entropy-Aware On-Policy Distillation of Language Models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  19. [19]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

  20. [20]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

  21. [21]

    Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  22. [22]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

  23. [23]

    Deeply- supervised nets

    Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply- supervised nets. InArtificial intelligence and statistics, pages 562–570. Pmlr, 2015. 27

  24. [24]

    Unifying group-relative and self-distillation policy optimiza- tion via sample routing.arXiv preprint arXiv:2604.02288, 2026a

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

  25. [25]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  26. [26]

    Linearly mapping from image to text space.arXiv preprint arXiv:2209.15162, 2022

    Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space.arXiv preprint arXiv:2209.15162, 2022

  27. [27]

    Relative representations enable zero-shot latent space communication.arXiv preprint arXiv:2209.15430, 2022

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication.arXiv preprint arXiv:2209.15430, 2022

  28. [28]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: hints for thin deep nets (2014).arXiv preprint arXiv:1412.6550, 3, 2014

  29. [29]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

  30. [30]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  31. [31]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021

  32. [32]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  33. [33]

    Mo- bilebert: a compact task-agnostic bert for resource-limited devices

    Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mo- bilebert: a compact task-agnostic bert for resource-limited devices. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 2158–2170, 2020

  34. [34]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

  35. [35]

    Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, 2021

  36. [36]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

  37. [37]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  38. [38]

    Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math.arXiv preprint arXiv:2504.21233, 2025

    Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, et al. Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math.arXiv preprint arXiv:2504.21233, 2025

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  40. [40]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026. 28

  41. [41]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  42. [42]

    Online Experiential Learning for Language Models

    Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

  43. [43]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  44. [44]

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

    Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017

  45. [45]

    Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222– 113244, 2026

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222– 113244, 2026

  46. [46]

    Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928, 2016

  47. [47]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  48. [48]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  49. [49]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 29 Table 7: Notation summary (Part I): models, architecture, distributions, and objectives. Symbol Mea...

  50. [50]

    (OPRD) SNR(gOPRD) = +∞ as long as h(L) θ,t ̸=h (L) T,t (i.e., the OPRD loss has not yet converged). Proof. (OPD case.)By Lemma 1,∥¯g OPD∥2 2 =∥E p[ut∇logp]∥ 2

  51. [51]

    Since ¯u=DKL(p∥q)≤δ and Varp(ut)≤2δ+O(δ 2) by a standard Pinsker-type expansion oflog(p/q)aroundp=q, we have ∥¯gOPD∥2 2 =O(δ)

    Applying Cauchy–Schwarz, ∥¯gOPD∥2 2 ≤E p[u2 t ]·E p[∥∇logp∥ 2 2] = (Var p(ut) + ¯u2)·Tr(F(θ)), where F(θ) is the Fisher information matrix. Since ¯u=DKL(p∥q)≤δ and Varp(ut)≤2δ+O(δ 2) by a standard Pinsker-type expansion oflog(p/q)aroundp=q, we have ∥¯gOPD∥2 2 =O(δ). Meanwhile, by Theorem 4 (Eq. 21), Tr(Cov[gOPD])≥Var p(ut)· F min(θ) = Θ(δ) . Since the num...

  52. [52]

    mostly signal

    for the lower Lipschitz constant cℓ of ℓout, so the ratio between the two directions scales as (σ1/σd)2 up to constants determined byℓ out, recovering (29) after absorbing constants intoC ℓ. Remark2 (Intermediate layers).Theorems 7 and 8 concern only the last-layer hidden state, because any output-space ℓout is computed solely fromWhead h(L) and therefore...