SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

Bichuan Feng; Jie Sun; Junfeng Fang; Mao Zheng; Mingyang Song; Pengfei Liu; Qiyong Zhong; Xiang Wang; Yilin Cheng

REVIEW 2 major objections 2 minor 2 cited by

SimCT restores lost teacher signals in on-policy distillation by matching predictions over short multi-token sequences that both tokenizers can produce.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-22 10:26 UTC pith:AZEKMGBI

load-bearing objection SimCT gives a practical patch for on-policy distillation across mismatched tokenizers by adding short multi-token comparisons, with reported gains on math and code tasks, though the on-policy claim after re-tokenization needs verification. the 2 major comments →

arxiv 2605.07711 v2 pith:AZEKMGBI submitted 2026-05-08 cs.CL

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

Jie Sun , Mao Zheng , Mingyang Song , Qiyong Zhong , Yilin Cheng , Bichuan Feng , Pengfei Liu , Junfeng Fang

show 1 more author

Xiang Wang

This is my paper

classification cs.CL

keywords on-policy distillationcross-tokenizerknowledge distillationlanguage modelstokenizationmathematical reasoningcode generation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation assumes teacher and student outputs can be compared token by token, but this assumption collapses when the models use different tokenizers and silently drops large parts of the teacher signal at points of vocabulary disagreement. SimCT enlarges the supervision space to include short multi-token continuations that both tokenizers can realize, keeping the original loss form intact. The paper argues these units form the finest jointly usable interface and that coarser groupings erase distinctions helpful for on-policy learning. A reader would care because the method enables better knowledge transfer to smaller models in settings where tokenizer mismatch is routine, such as mathematical reasoning and code generation.

Core claim

The central claim is that comparing teacher and student over short multi-token continuations that both tokenizers can realize recovers the supervision discarded by exact shared-token matching, these units are the finest jointly tokenizable interface, and coarser alternatives remove distinctions useful for on-policy learning, all while leaving the OPD loss unchanged.

What carries the argument

Short multi-token continuations realizable by both tokenizers, used as additional supervision units alongside shared tokens in the unchanged OPD loss.

Load-bearing premise

Short multi-token continuations preserve teacher-student distinctions that matter for on-policy learning and that coarser matching would erase them.

What would settle it

If student models trained with SimCT show no measurable improvement over shared-token OPD on the mathematical reasoning and code-generation benchmarks, or if coarser units perform equally well in ablations.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Consistent gains appear over shared-vocabulary OPD and other cross-tokenizer baselines on three heterogeneous teacher-student pairs.
Ablations confirm the gains arise specifically from recovering supervision lost to exact shared-token matching.
The method works without any change to the underlying OPD loss function.
Coarser supervision units are shown to discard distinctions that remain useful for on-policy learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same supervision-recovery idea could be tested in other distillation settings that also rely on token-level alignment.
Adopting multi-token units might reduce the preprocessing cost of forcing tokenizer compatibility before distillation.
If the approach scales, it would allow direct distillation between models whose vocabularies share almost no tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper claims that on-policy distillation (OPD) between models with heterogeneous tokenizers loses substantial teacher signal through exact shared-token matching. SimCT enlarges the supervision space by additionally comparing teacher and student over short multi-token continuations that both tokenizers can realize, without changing the OPD loss form. The authors argue these units constitute the finest jointly tokenizable interface and that coarser alternatives discard useful distinctions. Experiments on three heterogeneous teacher-student pairs for mathematical reasoning and code-generation benchmarks report consistent gains over shared-vocabulary OPD and cross-tokenizer baselines, with ablations attributing the improvements to recovered supervision. Code is released.

Significance. If the central claim holds, SimCT provides a lightweight, loss-preserving way to recover supervision lost to tokenizer mismatch, a practical issue in distillation pipelines. The explicit ablations and public code strengthen verifiability. The result would be most impactful if the multi-token units demonstrably preserve on-policy properties while adding signal; otherwise the gains may reflect a different mechanism than claimed.

major comments (2)

[§3] §3 (method description of continuation realization): the procedure of decoding student-generated tokens to text and re-tokenizing segments under the teacher's vocabulary means the compared sequences are not sampled directly under the student's native policy. This introduces a potential distribution shift that must be shown not to violate the strict on-policy premise of OPD; without such justification the claim that SimCT 'leaves the OPD loss form itself unchanged' while remaining on-policy is not yet established.
[Experiments] Experiments section (results tables): the reported gains are described as 'consistent' but the manuscript provides no error bars, dataset sizes, number of runs, or statistical significance tests. Because the central claim rests on these empirical improvements over shared-token OPD, the absence of these details leaves the magnitude and reliability of the recovered supervision signal difficult to evaluate.

minor comments (2)

[Abstract] Abstract: quantitative details (exact deltas, dataset sizes, number of teacher-student pairs) are omitted, making it hard for readers to gauge the scale of the reported gains before reaching the results section.
[§3] Notation: the definition of 'short multi-token continuations' should be formalized (e.g., maximum length in tokens or characters) to allow precise reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify key aspects of our work. We provide point-by-point responses below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (method description of continuation realization): the procedure of decoding student-generated tokens to text and re-tokenizing segments under the teacher's vocabulary means the compared sequences are not sampled directly under the student's native policy. This introduces a potential distribution shift that must be shown not to violate the strict on-policy premise of OPD; without such justification the claim that SimCT 'leaves the OPD loss form itself unchanged' while remaining on-policy is not yet established.

Authors: We appreciate this observation on the on-policy property. In SimCT the student samples tokens directly from its native policy to generate a text continuation; this text is the common interface that is then re-tokenized under the teacher vocabulary solely to obtain the teacher's probability distribution over the equivalent sequence. The OPD loss remains unchanged in form because it is still applied to the student's token-level probabilities for its own generated sequence, now aligned against the teacher's view of the same underlying text. No off-policy sampling occurs for the student. We will revise §3 to include an explicit paragraph justifying that the procedure preserves the strict on-policy premise of OPD while recovering additional supervision at the text level. revision: partial
Referee: [Experiments] Experiments section (results tables): the reported gains are described as 'consistent' but the manuscript provides no error bars, dataset sizes, number of runs, or statistical significance tests. Because the central claim rests on these empirical improvements over shared-token OPD, the absence of these details leaves the magnitude and reliability of the recovered supervision signal difficult to evaluate.

Authors: We agree that reporting error bars, run counts, and significance tests is essential for evaluating the reliability of the gains. In the revised manuscript we will add error bars computed across multiple independent runs, explicitly state the exact sizes of the training and evaluation splits for each benchmark, and include statistical significance tests (paired t-tests) comparing SimCT against the shared-token OPD baseline. These additions will directly address concerns about the magnitude and consistency of the recovered supervision signal. revision: yes

Circularity Check

0 steps flagged

No circularity: method extends OPD loss directly with external benchmarks

full rationale

The paper defines SimCT as an enlargement of the supervision space via short multi-token continuations while explicitly leaving the OPD loss form unchanged, with gains demonstrated through ablations and comparisons on mathematical reasoning and code-generation benchmarks. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central premise relies on the external property that coarser alternatives discard useful distinctions, evaluated against independent baselines rather than tautological inputs. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract supplies no explicit free parameters or invented entities; the approach rests on a domain assumption about the granularity of jointly realizable sequences.

axioms (1)

domain assumption Short multi-token continuations constitute the finest jointly tokenizable supervision interface between heterogeneous tokenizers.
This premise directly justifies why the method recovers useful signal without altering the loss form.

pith-pipeline@v0.9.0 · 5785 in / 1224 out tokens · 43608 ms · 2026-05-22T10:26:49.520106+00:00 · methodology

0 comments

read the original abstract

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

Figures

Figures reproduced from arXiv: 2605.07711 by Bichuan Feng, Jie Sun, Junfeng Fang, Mao Zheng, Mingyang Song, Pengfei Liu, Qiyong Zhong, Xiang Wang, Yilin Cheng.

**Figure 1.** Figure 1: Motivation and performance of SimCT. (A) LLM tokenizers often share only partial vocabularies across model families, making shared-token distillation restrictive. (B) The same text can induce different token boundaries and prediction spaces, so standard token-level OPD is ill-defined. (C) SimCT constructs a common aligned supervision space and achieves the best average Pass@1 over SFT and prior cross-token… view at source ↗

**Figure 2.** Figure 2: Framework of SimCT. (A) In cross-tokenizer OPD, teacher and student next-token distributions are conditioned on the same student-generated prefix, but are defined over different tokenizer spaces. We make them comparable through a common supervision space U. (B) SimCT builds USimCT = (VT ∩ VS) ∪ A by adding minimal aligned units A, then applies the original OPD loss on the induced supervision distributions.… view at source ↗

**Figure 3.** Figure 3: Supervision recovery under tokenizer mismatch. (A) Many teacher and student tokens are not aligned one-to-one, revealing supervision loss from sequence mismatch. (B) Even at aligned positions, high-probability teacher predictions may fall outside the shared vocabulary. (C) Recovering either missing source improves over Base, while Full SimCT performs best by recovering both in a common aligned-unit space. … view at source ↗

**Figure 4.** Figure 4: Ablation on recovered unit supervision and aligned-unit coarsening. (Left) Recovering more mismatch-unit supervision consistently improves both students. (Right) Coarsening minimal aligned units removes within-span KL signal and reduces downstream gains, supporting the need for SimCT’s minimal aligned units. Thus, ∆C measures the within-unit KL signal removed when minimal aligned units are merged. • Obs 7:… view at source ↗

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

minimal aligned units... finest boundary-consistent supervision interface jointly expressible by the teacher and student tokenizers
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel and Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL(qmin_S ∥ qmin_T) ≥ KL(qC_S ∥ qC_T) ... within-unit teacher–student discrepancy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cross-Tokenizer On-Policy Distillation via Byte-Prefix Marginalization
cs.LG 2026-07 conditional novelty 7.0

Byte-Prefix Marginalization maps a teacher's next-token distribution onto the student's vocabulary through shared byte prefixes plus an explicit residual, giving a mass-preserving target for on-policy distillation acr...
Group-Reflective Self-Distillation for Agentic Reinforcement Learning
cs.AI 2026-07 conditional novelty 6.0

Contrasting a policy’s own success and failure reflections yields turn-level credit that beats GRPO and skill-based self-distillation on agentic tasks.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 2 Pith papers · 23 internal anchors

[1]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.arXiv preprint, arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics (TACL), 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024
[3]

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv preprint, arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Survey on knowledge distillation for large language models: Methods, evaluation, and application.arXiv preprint, arXiv:2407.01885, 2024

Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chena, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: Methods, evaluation, and application.arXiv preprint, arXiv:2407.01885, 2024

work page arXiv 2024
[5]

DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. InProceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS 2019), Vancouver, Canada, 2019

work page 2019
[6]

TinyBERT: Distilling BERT for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online Event, 2020

work page 2020
[7]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

work page 2024
[8]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems 28 (NeurIPS 2015), 2015

work page 2015
[9]

Bridging the gap between training and inference for neural machine translation

Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridging the gap between training and inference for neural machine translation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pages 4334–4343, Florence, Italy, 2019

work page 2019
[10]

Autoregressive knowledge distillation through imitation learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation through imitation learning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 6121–6133, Online Event, 2020

work page 2020
[11]

SequenceMatch: Imitation learning for autoregressive sequence modelling with backtracking

Chris Cundy and Stefano Ermon. SequenceMatch: Imitation learning for autoregressive sequence modelling with backtracking. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024

work page 2024
[12]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), pages 627–635, Fort Lauderdale, FL, USA, 2011

work page 2011
[13]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024

work page 2024
[14]

MiniLLM: On-policy distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024. 10

work page 2024
[15]

DistiLLM: Towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined distillation for large language models. InProceedings of the 41st International Conference on Machine Learning (ICML 2024), volume 235, Vienna, Austria, 2024

work page 2024
[16]

DistiLLM-2: A contrastive approach boosts the distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025

work page 2025
[17]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

work page arXiv 2025
[18]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, and Hai Cai. Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint, arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint, arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint, arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint, arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint, arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Towards cross- tokenizer distillation: The universal logit distillation loss for LLMs.Transactions on Machine Learning Research (TMLR), 2025

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross- tokenizer distillation: The universal logit distillation loss for LLMs.Transactions on Machine Learning Research (TMLR), 2025

work page 2025
[24]

Dual-space knowledge distillation for large language models

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), pages 18164–18181, Miami, Florida, USA, 2024

work page 2024
[25]

CTPD: Cross-tokenizer preference distillation

Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, and Thanh Hong Nguyen. CTPD: Cross-tokenizer preference distillation. InProceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), Philadelphia, PA, USA, 2026

work page 2026
[26]

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer LLM distillation through a byte-level interface.arXiv preprint, arXiv:2604.07466, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Cross-tokenizer likelihood scoring algorithms for language model distillation

Buu Phan, Ashish Khisti, and Karen Ullrich. Cross-tokenizer likelihood scoring algorithms for language model distillation. InProceedings of the 14th International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026

work page 2026
[28]

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, and Houqiang Li. Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025), Philadelphia, PA, 2025

work page 2025
[29]

Unlocking on-policy distillation for any model family.Hugging Face Tech- nical Report, 2025

César Miguel Patiño, Kashif Rasul, Quentin Gallouédec, Ben Burtenshaw, Sergio Paniego, Vaishakh Srivastav, Thomas Frere, Edward Beeching, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Unlocking on-policy distillation for any model family.Hugging Face Tech- nical Report, 2025. https://huggingfaceh4-on-policy-distillation.hf. space/unlocking-on-policy...

work page 2025
[30]

arXiv preprint arXiv:2504.11426 , year =

Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. A dual-space framework for general knowledge distillation of large language models. arXiv preprint, arXiv:2504.11426, 2025. 11

work page arXiv 2025
[31]

Universal cross-tokenizer distil- lation via approximate likelihood matching

Benjamin Minixhofer, Ivan Vuli´c, and Edoardo Maria Ponti. Universal cross-tokenizer distil- lation via approximate likelihood matching. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

work page 2025
[32]

Enhancing cross-tokenizer knowledge distillation with contextual dynamical mapping

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. Enhancing cross-tokenizer knowledge distillation with contextual dynamical mapping. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8005–8018, Vienna, Austria, 2025

work page 2025
[33]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint, arXiv:2603.07079, 2026

work page internal anchor Pith review arXiv 2026
[34]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/ on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[35]

Rethinking kullback- leibler divergence in knowledge distillation for large language models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, and Ngai Wong. Rethinking kullback- leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics (COLING 2025), 2025

work page 2025
[36]

arXiv preprint arXiv:2604.00223 , year=

Hoang-Chau Luong, Dat Ba Tran, and Lingwei Chen. Diversity-aware reverse kullback-leibler divergence for large language model distillation.arXiv preprint, arXiv:2604.00223, 2026

work page arXiv 2026
[37]

Distillation of Large Language Models via Concrete Score Matching

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, and Il-Chul Moon. Distillation of large language models via concrete score matching.arXiv preprint, arXiv:2509.25837, 2025

work page internal anchor Pith review arXiv 2025
[38]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint, arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs.arXiv preprint, arXiv:2510.24021, 2025

work page arXiv 2025
[40]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distillation.arXiv preprint, arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, and Thien Huu Nguyen. CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

work page arXiv 2025
[42]

Dual-space knowledge distillation with key-query matching for large language models with vocabulary mismatch

Stella Eva Tsiapali, Cong-Thanh Do, and Kate Knill. Dual-space knowledge distillation with key-query matching for large language models with vocabulary mismatch. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026), 2026

work page 2026
[43]

Overcoming vocabulary mismatch: V ocabulary- agnostic teacher guided language modeling.arXiv preprint, arXiv:2503.19123, 2025

Haebin Shin, Lei Ji, Xiao Liu, and Yeyun Gong. Overcoming vocabulary mismatch: V ocabulary- agnostic teacher guided language modeling.arXiv preprint, arXiv:2503.19123, 2025

work page arXiv 2025
[44]

MoL: Mixture of layers in cross- tokenizer embedding model distillation.Knowledge-Based Systems, 2026

Hai An Vu, Minh-Phuc Truong, Tu Vu, and Linh Ngo Van. MoL: Mixture of layers in cross- tokenizer embedding model distillation.Knowledge-Based Systems, 2026

work page 2026
[45]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason ...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Jun-Kun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint, arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Mitra, H

Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of SLMs in grade school math.arXiv preprint, arXiv:2402.14830, 2024

work page arXiv 2024
[51]

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Git- man. OpenMathInstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint, arXiv:2402.10176, 2024

work page Pith review arXiv 2024
[52]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021

work page 2021
[53]

OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint arXiv:2504.04030, 2025

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint, arXiv:2504.04030, 2025

work page arXiv 2025
[54]

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

work page Pith review arXiv 2025
[55]

Taco: Topics in algorithmic code generation dataset

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. TACO: Topics in algorithmic COde generation dataset.arXiv preprint, arXiv:2312.14852, 2023

work page arXiv 2023
[56]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page 2022
[57]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), 2024

work page 2024
[58]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint, arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

LiveCodeBench: Holistic and contamina- tion free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamina- tion free evaluation of large language models for code. InProceedings of the 13th International Conference on Learning Representations (ICLR 2025), 2025

work page 2025
[60]

Zhang, X

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, and Jinan Xu. KDFlow: A user-friendly and efficient knowledge distillation framework for large language models.arXiv preprint, arXiv:2603.01875, 2026

work page internal anchor Pith review arXiv 2026
[61]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[62]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, and Pengfei Liu. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint, arXiv:2601.11044, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Limi: Less is more for agency.arXiv preprint, arXiv:2509.17567, 2025

Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, and Pengfei Liu. Limi: Less is more for agency.arXiv preprint, arXiv:2509.17567, 2025

work page arXiv 2025
[64]

Innovatorbench: Evaluating agents’ ability to conduct innovative llm research.arXiv preprint arXiv:2510.27598,

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, and Pengfei Liu. Innovatorbench: Evaluating agents’ ability to conduct innovative LLM research.arXiv preprint, arXiv:2510.27598, 2025

work page arXiv 2025
[65]

davinci-dev: Agent-native mid-training for software engineering,

Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, and Pengfei Liu. davinci-dev: Agent-native mid-training for software engineering.arXiv preprint, arXiv:2601.18418, 2026

work page arXiv 2026
[66]

Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint arXiv:2602.14337, 2026

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, and Kaipeng Zhang. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint,...

work page arXiv 2026
[67]

davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023,

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, and Pengfei Liu. davinci-env: Open swe environment synthesis at scale.arXiv preprint, arXiv:2603.13023, 2026

work page arXiv 2026
[68]

Argo: Asynchronous rollout with human guidance for research agent optimization.OpenReview preprint, 2026

Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Xiangkun Hu, Yang Xiao, and Pengfei Liu. Argo: Asynchronous rollout with human guidance for research agent optimization.OpenReview preprint, 2026. 14

work page 2026
[69]

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, and Junfeng Fang. Sod: Step-wise on-policy distillation for small language model agents.arXiv preprint, arXiv:2605.07725, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026a

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint, arXiv:2604.02288, 2026

work page arXiv 2026
[71]

Robust preference optimization via dynamic target margins

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xi- ang Wang. Robust preference optimization via dynamic target margins. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5399–5416, 2025

work page 2025
[72]

Lamp- val: Large language models empower personalized valuation in auction

Jie Sun, Tianyu Zhang, Houcheng Jiang, Kexin Huang, Xiang Shu, Zhibo Zhu, Lintao Ma, Xingyu Lu, Jun Zhou, Junkang Wu, Chi Luo, An Zhang, Jiancan Wu, and Xiang Wang. Lamp- val: Large language models empower personalized valuation in auction. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 579–595, 2025

work page 2025
[73]

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

Jie Sun, Yu Liu, Lu Han, Qiwen Deng, Xiang Shu, Yang Xiao, Xingyu Lu, Jun Zhou, Pengfei Liu, Lintao Ma, Jiancan Wu, and Xiang Wang. Sepseq: A training-free framework for long numerical sequence processing in llms.arXiv preprint, arXiv:2604.07737, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[74]

A simple data augmentation for graph classification: A perspective of equivariance and invariance.ACM Transactions on Knowledge Discovery from Data, 19(2): 1–24, 2025

Yongduo Sui, Shuyao Wang, Jie Sun, Zhiyuan Liu, Qing Cui, Longfei Li, Jun Zhou, Xiang Wang, and Xiangnan He. A simple data augmentation for graph classification: A perspective of equivariance and invariance.ACM Transactions on Knowledge Discovery from Data, 19(2): 1–24, 2025

work page 2025
[75]

A unified invariant learning framework for graph classification

Yongduo Sui, Jie Sun, Shuyao Wang, Zemin Liu, Qing Cui, Longfei Li, and Xiang Wang. A unified invariant learning framework for graph classification. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. 15 A Training Data Construction and Curation This appendix describes the construction of the cold-start SFT corpus...

work page 2025
[76]

Since 10000 = 7×1428 + 4 , the remainder is 4

= 10000 . Since 10000 = 7×1428 + 4 , the remainder is 4. Answer:4 ✓Correct SimpleOPD Correctly identifies 100 terms and computes the sum as 10000. However, makes an arithmetic error in the final modular division step, computing10000÷7 = 1428remainder3instead of4. Answer:3 ✗Incorrect ALM Incorrectly counts the number of terms as 199 (confusing the last ter...

work page

[1] [1]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.arXiv preprint, arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics (TACL), 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024

[3] [3]

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv preprint, arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Survey on knowledge distillation for large language models: Methods, evaluation, and application.arXiv preprint, arXiv:2407.01885, 2024

Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chena, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: Methods, evaluation, and application.arXiv preprint, arXiv:2407.01885, 2024

work page arXiv 2024

[5] [5]

DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. InProceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS 2019), Vancouver, Canada, 2019

work page 2019

[6] [6]

TinyBERT: Distilling BERT for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online Event, 2020

work page 2020

[7] [7]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

work page 2024

[8] [8]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems 28 (NeurIPS 2015), 2015

work page 2015

[9] [9]

Bridging the gap between training and inference for neural machine translation

Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridging the gap between training and inference for neural machine translation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pages 4334–4343, Florence, Italy, 2019

work page 2019

[10] [10]

Autoregressive knowledge distillation through imitation learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation through imitation learning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 6121–6133, Online Event, 2020

work page 2020

[11] [11]

SequenceMatch: Imitation learning for autoregressive sequence modelling with backtracking

Chris Cundy and Stefano Ermon. SequenceMatch: Imitation learning for autoregressive sequence modelling with backtracking. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024

work page 2024

[12] [12]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), pages 627–635, Fort Lauderdale, FL, USA, 2011

work page 2011

[13] [13]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024

work page 2024

[14] [14]

MiniLLM: On-policy distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024. 10

work page 2024

[15] [15]

DistiLLM: Towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined distillation for large language models. InProceedings of the 41st International Conference on Machine Learning (ICML 2024), volume 235, Vienna, Austria, 2024

work page 2024

[16] [16]

DistiLLM-2: A contrastive approach boosts the distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), 2025

work page 2025

[17] [17]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

work page arXiv 2025

[18] [18]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, and Hai Cai. Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint, arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint, arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint, arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint, arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint, arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Towards cross- tokenizer distillation: The universal logit distillation loss for LLMs.Transactions on Machine Learning Research (TMLR), 2025

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross- tokenizer distillation: The universal logit distillation loss for LLMs.Transactions on Machine Learning Research (TMLR), 2025

work page 2025

[24] [24]

Dual-space knowledge distillation for large language models

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), pages 18164–18181, Miami, Florida, USA, 2024

work page 2024

[25] [25]

CTPD: Cross-tokenizer preference distillation

Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, and Thanh Hong Nguyen. CTPD: Cross-tokenizer preference distillation. InProceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), Philadelphia, PA, USA, 2026

work page 2026

[26] [26]

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer LLM distillation through a byte-level interface.arXiv preprint, arXiv:2604.07466, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Cross-tokenizer likelihood scoring algorithms for language model distillation

Buu Phan, Ashish Khisti, and Karen Ullrich. Cross-tokenizer likelihood scoring algorithms for language model distillation. InProceedings of the 14th International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026

work page 2026

[28] [28]

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, and Houqiang Li. Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025), Philadelphia, PA, 2025

work page 2025

[29] [29]

Unlocking on-policy distillation for any model family.Hugging Face Tech- nical Report, 2025

César Miguel Patiño, Kashif Rasul, Quentin Gallouédec, Ben Burtenshaw, Sergio Paniego, Vaishakh Srivastav, Thomas Frere, Edward Beeching, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Unlocking on-policy distillation for any model family.Hugging Face Tech- nical Report, 2025. https://huggingfaceh4-on-policy-distillation.hf. space/unlocking-on-policy...

work page 2025

[30] [30]

arXiv preprint arXiv:2504.11426 , year =

Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. A dual-space framework for general knowledge distillation of large language models. arXiv preprint, arXiv:2504.11426, 2025. 11

work page arXiv 2025

[31] [31]

Universal cross-tokenizer distil- lation via approximate likelihood matching

Benjamin Minixhofer, Ivan Vuli´c, and Edoardo Maria Ponti. Universal cross-tokenizer distil- lation via approximate likelihood matching. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025

work page 2025

[32] [32]

Enhancing cross-tokenizer knowledge distillation with contextual dynamical mapping

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. Enhancing cross-tokenizer knowledge distillation with contextual dynamical mapping. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8005–8018, Vienna, Austria, 2025

work page 2025

[33] [33]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint, arXiv:2603.07079, 2026

work page internal anchor Pith review arXiv 2026

[34] [34]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/ on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[35] [35]

Rethinking kullback- leibler divergence in knowledge distillation for large language models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, and Ngai Wong. Rethinking kullback- leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics (COLING 2025), 2025

work page 2025

[36] [36]

arXiv preprint arXiv:2604.00223 , year=

Hoang-Chau Luong, Dat Ba Tran, and Lingwei Chen. Diversity-aware reverse kullback-leibler divergence for large language model distillation.arXiv preprint, arXiv:2604.00223, 2026

work page arXiv 2026

[37] [37]

Distillation of Large Language Models via Concrete Score Matching

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, and Il-Chul Moon. Distillation of large language models via concrete score matching.arXiv preprint, arXiv:2509.25837, 2025

work page internal anchor Pith review arXiv 2025

[38] [38]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint, arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs.arXiv preprint, arXiv:2510.24021, 2025

work page arXiv 2025

[40] [40]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distillation.arXiv preprint, arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, and Thien Huu Nguyen. CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

work page arXiv 2025

[42] [42]

Dual-space knowledge distillation with key-query matching for large language models with vocabulary mismatch

Stella Eva Tsiapali, Cong-Thanh Do, and Kate Knill. Dual-space knowledge distillation with key-query matching for large language models with vocabulary mismatch. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026), 2026

work page 2026

[43] [43]

Overcoming vocabulary mismatch: V ocabulary- agnostic teacher guided language modeling.arXiv preprint, arXiv:2503.19123, 2025

Haebin Shin, Lei Ji, Xiao Liu, and Yeyun Gong. Overcoming vocabulary mismatch: V ocabulary- agnostic teacher guided language modeling.arXiv preprint, arXiv:2503.19123, 2025

work page arXiv 2025

[44] [44]

MoL: Mixture of layers in cross- tokenizer embedding model distillation.Knowledge-Based Systems, 2026

Hai An Vu, Minh-Phuc Truong, Tu Vu, and Linh Ngo Van. MoL: Mixture of layers in cross- tokenizer embedding model distillation.Knowledge-Based Systems, 2026

work page 2026

[45] [45]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason ...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[46] [46]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Jun-Kun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint, arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint, arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[50] [50]

Mitra, H

Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of SLMs in grade school math.arXiv preprint, arXiv:2402.14830, 2024

work page arXiv 2024

[51] [51]

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Git- man. OpenMathInstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint, arXiv:2402.10176, 2024

work page Pith review arXiv 2024

[52] [52]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021

work page 2021

[53] [53]

OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint arXiv:2504.04030, 2025

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint, arXiv:2504.04030, 2025

work page arXiv 2025

[54] [54]

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

work page Pith review arXiv 2025

[55] [55]

Taco: Topics in algorithmic code generation dataset

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. TACO: Topics in algorithmic COde generation dataset.arXiv preprint, arXiv:2312.14852, 2023

work page arXiv 2023

[56] [56]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page 2022

[57] [57]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InProceedings of the 12th International Conference on Learning Representations (ICLR 2024), 2024

work page 2024

[58] [58]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint, arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[59] [59]

LiveCodeBench: Holistic and contamina- tion free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamina- tion free evaluation of large language models for code. InProceedings of the 13th International Conference on Learning Representations (ICLR 2025), 2025

work page 2025

[60] [60]

Zhang, X

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, and Jinan Xu. KDFlow: A user-friendly and efficient knowledge distillation framework for large language models.arXiv preprint, arXiv:2603.01875, 2026

work page internal anchor Pith review arXiv 2026

[61] [61]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[62] [62]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, and Pengfei Liu. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint, arXiv:2601.11044, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

Limi: Less is more for agency.arXiv preprint, arXiv:2509.17567, 2025

Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, and Pengfei Liu. Limi: Less is more for agency.arXiv preprint, arXiv:2509.17567, 2025

work page arXiv 2025

[64] [64]

Innovatorbench: Evaluating agents’ ability to conduct innovative llm research.arXiv preprint arXiv:2510.27598,

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, and Pengfei Liu. Innovatorbench: Evaluating agents’ ability to conduct innovative LLM research.arXiv preprint, arXiv:2510.27598, 2025

work page arXiv 2025

[65] [65]

davinci-dev: Agent-native mid-training for software engineering,

Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, and Pengfei Liu. davinci-dev: Agent-native mid-training for software engineering.arXiv preprint, arXiv:2601.18418, 2026

work page arXiv 2026

[66] [66]

Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint arXiv:2602.14337, 2026

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, and Kaipeng Zhang. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint,...

work page arXiv 2026

[67] [67]

davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023,

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, and Pengfei Liu. davinci-env: Open swe environment synthesis at scale.arXiv preprint, arXiv:2603.13023, 2026

work page arXiv 2026

[68] [68]

Argo: Asynchronous rollout with human guidance for research agent optimization.OpenReview preprint, 2026

Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Xiangkun Hu, Yang Xiao, and Pengfei Liu. Argo: Asynchronous rollout with human guidance for research agent optimization.OpenReview preprint, 2026. 14

work page 2026

[69] [69]

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, and Junfeng Fang. Sod: Step-wise on-policy distillation for small language model agents.arXiv preprint, arXiv:2605.07725, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026a

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint, arXiv:2604.02288, 2026

work page arXiv 2026

[71] [71]

Robust preference optimization via dynamic target margins

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xi- ang Wang. Robust preference optimization via dynamic target margins. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5399–5416, 2025

work page 2025

[72] [72]

Lamp- val: Large language models empower personalized valuation in auction

Jie Sun, Tianyu Zhang, Houcheng Jiang, Kexin Huang, Xiang Shu, Zhibo Zhu, Lintao Ma, Xingyu Lu, Jun Zhou, Junkang Wu, Chi Luo, An Zhang, Jiancan Wu, and Xiang Wang. Lamp- val: Large language models empower personalized valuation in auction. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 579–595, 2025

work page 2025

[73] [73]

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

Jie Sun, Yu Liu, Lu Han, Qiwen Deng, Xiang Shu, Yang Xiao, Xingyu Lu, Jun Zhou, Pengfei Liu, Lintao Ma, Jiancan Wu, and Xiang Wang. Sepseq: A training-free framework for long numerical sequence processing in llms.arXiv preprint, arXiv:2604.07737, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [74]

A simple data augmentation for graph classification: A perspective of equivariance and invariance.ACM Transactions on Knowledge Discovery from Data, 19(2): 1–24, 2025

Yongduo Sui, Shuyao Wang, Jie Sun, Zhiyuan Liu, Qing Cui, Longfei Li, Jun Zhou, Xiang Wang, and Xiangnan He. A simple data augmentation for graph classification: A perspective of equivariance and invariance.ACM Transactions on Knowledge Discovery from Data, 19(2): 1–24, 2025

work page 2025

[75] [75]

A unified invariant learning framework for graph classification

Yongduo Sui, Jie Sun, Shuyao Wang, Zemin Liu, Qing Cui, Longfei Li, and Xiang Wang. A unified invariant learning framework for graph classification. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. 15 A Training Data Construction and Curation This appendix describes the construction of the cold-start SFT corpus...

work page 2025

[76] [76]

Since 10000 = 7×1428 + 4 , the remainder is 4

= 10000 . Since 10000 = 7×1428 + 4 , the remainder is 4. Answer:4 ✓Correct SimpleOPD Correctly identifies 100 terms and computes the sum as 10000. However, makes an arithmetic error in the final modular division step, computing10000÷7 = 1428remainder3instead of4. Answer:3 ✗Incorrect ALM Incorrectly counts the number of terms as 199 (confusing the last ter...

work page