DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Alham Fikri Aji; Jian Gang Ngui; Patomporn Payoungkhamdee; Peerat Limkonchotiwat; Sarana Nutanong; Tinnakit Udsa

arxiv: 2606.04694 · v2 · pith:RSSB5XUEnew · submitted 2026-06-03 · 💻 cs.CL

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Patomporn Payoungkhamdee , Tinnakit Udsa , Jian Gang Ngui , Sarana Nutanong , Alham Fikri Aji , Peerat Limkonchotiwat This is my paper

Pith reviewed 2026-06-28 06:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual distillationsmall language modelscross-lingual verbalizerknowledge distillationSoutheast Asian languagesSEA-HELMsequence-level optimizationtoken-level supervision

0 comments

The pith

DuDi combines sequence-level and token-level signals plus a cross-lingual verbalizer to improve distillation of multilingual capabilities into small language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DuDi, a distillation method that pairs an online sequence-level optimization signal with both off-policy and on-policy token-level signals. It adds a cross-lingual verbalizer that rewrites teacher outputs to make feedback more transferable across languages. Experiments across model families and scales on SEA-HELM show consistent gains over standard distillation baselines, with ablations attributing the gains to the complementarity of the three signals.

Core claim

DuDi is a dual-signal multilingual distillation framework that integrates online sequence-level supervision with off-policy and on-policy token-level supervision and applies a cross-lingual verbalizer to refine teacher feedback, thereby improving teacher-student transferability for sub-billion-parameter models on Southeast Asian languages.

What carries the argument

DuDi dual-signal distillation framework: the mechanism that merges sequence-level and token-level signals while routing teacher feedback through a cross-lingual verbalizer to produce more transferable training targets.

If this is right

Multilingual small language models can retain higher SEA-language accuracy after distillation when both sequence-level and token-level signals are used together.
Cross-lingual verbalization reduces the mismatch between teacher outputs and student language distributions.
Ablation results indicate that removing any one of the three signals degrades performance relative to the full DuDi combination.
The method scales across different model families and teacher-student size ratios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-signal pattern could be tested on other low-resource language families beyond Southeast Asia.
If the verbalizer is language-pair specific, its construction cost may limit application to very large numbers of languages.
Token-level signals might be replaced by cheaper synthetic data sources while preserving most of the reported gain.
Sequence-level optimization may interact with reinforcement-learning-style objectives that are already common in post-training.

Load-bearing premise

That sequence-level optimization, token-level supervision, and cross-lingual verbalization supply complementary and transferable learning signals for multilingual small language models.

What would settle it

Run the same teacher-student pairs on SEA-HELM with and without the cross-lingual verbalizer component; if the version lacking the verbalizer matches or exceeds DuDi performance, the claim that the three signals are complementary collapses.

Figures

Figures reproduced from arXiv: 2606.04694 by Alham Fikri Aji, Jian Gang Ngui, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Sarana Nutanong, Tinnakit Udsa.

**Figure 2.** Figure 2: Overview of the DuDi framework, which inte [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Example of cross-lingual verbalized teacher [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overlap ratio between teacher and student log [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Overlap ratio between student and teacher [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 6.** Figure 6: Comparison of verbalizer templates for teacher prompt, including the English verbalizer from (Shenfeld et al., 2026), our extended multilingual verbalizer, and the proposed cross-lingual verbalizer with its corresponding student prompt example in Thai. ine whether DFT provides a better initialization checkpoint than SFT as a cold-start. We compare off-policy fine-tuning (cold-start) initialized from SFT a… view at source ↗

read the original abstract

Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DuDi is a practical distillation tweak that combines sequence and token signals with a cross-lingual verbalizer and reports steady gains on SEA-HELM, but the improvements look incremental rather than transformative.

read the letter

The core takeaway is that this paper gives a workable recipe for better multilingual distillation on small models for Southeast Asian languages. It mixes an online sequence-level signal with off-policy and on-policy token signals, then adds a cross-lingual verbalizer to clean up teacher feedback. The experiments claim this beats standard baselines across several model families and scales.

What stands out as new is the specific three-way combination rather than any single component. The ablations are the strongest part: they test whether the signals are actually complementary and show that removing any one hurts performance. That kind of check is useful and not always done.

The results are presented as consistent outperformance on SEA-HELM, which is a reasonable benchmark for the target region. The setup covers multiple teacher-student pairs and model sizes, so the claims are not resting on a single narrow test.

The soft spots are mostly about scale and detail. The gains are described as consistent but no effect sizes or variance numbers appear in the high-level summary, so it is hard to tell whether the lift justifies the extra machinery. The work stays tightly focused on SEA languages; there is no evidence yet that the verbalizer or signal mix transfers cleanly outside that group. The method also adds several moving parts, and the paper does not discuss whether simpler combinations could capture most of the benefit.

This is the kind of paper that matters for groups building or deploying small multilingual models in resource-limited settings. It is not going to change how people think about distillation in general, but the empirical checks are honest enough that a referee could evaluate the claims directly.

I would send it to peer review. The experiments are scoped and the ablations address the main internal question, so referees can judge whether the reported improvements hold up.

Referee Report

1 major / 0 minor

Summary. The paper introduces DuDi, a dual-signal multilingual distillation framework for small language models that combines an online sequence-level signal with off-policy and on-policy token-level signals, plus a cross-lingual verbalizer to refine teacher feedback. It claims consistent outperformance over competitive distillation baselines on SEA-HELM across model families, scales, and teacher-student settings, with ablations confirming that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable signals.

Significance. If the empirical results hold with proper controls and statistical support, the work could meaningfully advance distillation techniques for improving multilingual performance of sub-billion SLMs on underrepresented SEA languages, where degradation is severe. The multi-signal approach, if shown to be additive, offers a practical direction for low-resource transfer.

major comments (1)

Abstract: the claim that DuDi 'consistently outperforms competitive distillation baselines' and that the three signals 'provide complementary and transferable learning signals' is asserted without any quantitative results, error bars, baseline details, or statistical tests, so the central empirical claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed review. We address the major comment on the abstract below.

read point-by-point responses

Referee: Abstract: the claim that DuDi 'consistently outperforms competitive distillation baselines' and that the three signals 'provide complementary and transferable learning signals' is asserted without any quantitative results, error bars, baseline details, or statistical tests, so the central empirical claim cannot be evaluated.

Authors: Abstracts are intentionally concise high-level summaries and standard practice omits detailed metrics, error bars, and tests (which appear in the full paper). Section 4 presents SEA-HELM results across model families/scales/settings with tables comparing DuDi to baselines; Section 5 contains ablations confirming complementary signals; the experimental protocol and baseline descriptions are in Sections 3.2 and 4.1. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., average gains) while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical distillation framework (DuDi) combining sequence-level optimization, token-level signals, and a cross-lingual verbalizer, with performance claims resting entirely on experiments across SEA-HELM benchmarks, multiple model families, and ablations. No derivation chain, equations, fitted parameters, or first-principles results are presented that could reduce to inputs by construction. The abstract and high-level description contain no self-definitional steps, fitted-input predictions, or load-bearing self-citations; the complementarity conclusion is framed as an empirical finding from ablations rather than a logical necessity. This is a standard empirical methods paper whose central claims are externally falsifiable via replication on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or derivations; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5676 in / 1095 out tokens · 43928 ms · 2026-06-28T06:30:07.848459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages

[1]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =
[2]

The 1st Workshop on Scaling Post-training for LLMs , year=

Reinforcement Learning via Self-Distillation , author=. The 1st Workshop on Scaling Post-training for LLMs , year=
[3]

ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations , year=

deep-significance: Easy and Meaningful Signifcance Testing in the Age of Neural Networks , author=. ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations , year=
[4]

Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =

Rotem Dror and Segev Shlomov and Roi Reichart , editor =. Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =. 2019 , url =. doi:10.18653/v1/p19-1266 , timestamp =

work page doi:10.18653/v1/p19-1266 2019
[5]

The Mathematics of the Uncertain , pages=

An optimal transportation approach for assessing almost stochastic order , author=. The Mathematics of the Uncertain , pages=. 2018 , publisher=

2018
[6]

Nature , year=

Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and Chen, Yutian and Lillicrap, Timothy and Hui, Fan and Sifre, Laurent and van den Driessche, George and Graepel, Thore and Hassabis, Demis , title=. Nature , year=. do...

work page doi:10.1038/nature24270
[7]

Tesauro, Gerald , title =. Commun. ACM , month = mar, pages =. 1995 , issue_date =. doi:10.1145/203330.203343 , abstract =

work page doi:10.1145/203330.203343 1995
[8]

2026 , eprint=

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models , author=. 2026 , eprint=

2026
[9]

arXiv preprint arXiv:2604.13016 , year=

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. arXiv preprint arXiv:2604.13016 , year=

Pith/arXiv arXiv
[10]

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Zhang, Yuanchi and Wang, Yile and Liu, Zijun and Wang, Shuo and Wang, Xiaolong and Li, Peng and Sun, Maosong and Liu, Yang. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 202...

work page doi:10.18653/v1/2024.acl-long.603 2024
[11]

Yu , keywords =

Libo Qin and Qiguang Chen and Yuhang Zhou and Zhi Chen and Yinghui Li and Lizi Liao and Min Li and Wanxiang Che and Philip S. Yu , keywords =. A survey of multilingual large language models , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.patter.2024.101118 , url =

work page doi:10.1016/j.patter.2024.101118 2025
[12]

C 2 KD : Cross-layer and Cross-head Knowledge Distillation for Small Language Model-based Recommendation

Chen, Xiao and Ma, Changyi and Fan, Wenqi and Zhang, Zhaoxiang and Qing, Li. C 2 KD : Cross-layer and Cross-head Knowledge Distillation for Small Language Model-based Recommendation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.917

work page doi:10.18653/v1/2025.findings-acl.917 2025
[13]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025
[14]

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

Yang, Zhaorui and Pang, Tianyu and Feng, Haozhe and Wang, Han and Chen, Wei and Zhu, Minfeng and Liu, Qian. Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.58

work page doi:10.18653/v1/2024.acl-long.58 2024
[15]

and Nguyen, Phat T

Pham, Thang M. and Nguyen, Phat T. and Yoon, Seunghyun and Lai, Viet Dac and Dernoncourt, Franck and Bui, Trung. S lim LM : An Efficient Small Language Model for On-Device Document Assistance. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2025. doi:10.18653/v1/2025.acl-demo.42

work page doi:10.18653/v1/2025.acl-demo.42 2025
[16]

Shengding Hu and Yuge Tu and Xu Han and Ganqu Cui and Chaoqun He and Weilin Zhao and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Xinrong Zhang and Zhen Leng Thai and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and dahai li and Zhiyuan Liu and Maosong Su...

2024
[17]

Zechun Liu and Changsheng Zhao and Forrest Iandola and Chen Lai and Yuandong Tian and Igor Fedorov and Yunyang Xiong and Ernie Chang and Yangyang Shi and Raghuraman Krishnamoorthi and Liangzhen Lai and Vikas Chandra , booktitle=. Mobile. 2024 , url=

2024
[18]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Susanto, Yosephine and Hulagadri, Adithya Venkatadri and Montalan, Jann Railey and Ngui, Jian Gang and Yong, Xianbin and Leong, Wei Qi and Rengarajan, Hamsawardhini and Limkonchotiwat, Peerat and Mai, Yifan and Tjhi, William Chandra. SEA - HELM : S outheast A sian Holistic Evaluation of Language Models. Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.636 2025
[19]

2026 , eprint=

TIP: Token Importance in On-Policy Distillation , author=. 2026 , eprint=

2026
[20]

Jongwoo Ko and Tianyi Chen and Sungnyun Kim and Tianyu Ding and Luming Liang and Ilya Zharkov and Se-Young Yun , booktitle=. Disti. 2025 , url=

2025
[21]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=
[22]

Jongwoo Ko and Sungnyun Kim and Tianyi Chen and Se-Young Yun , booktitle=. Disti. 2024 , url=

2024
[23]

Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

2024
[24]

Autoregressive Knowledge Distillation through Imitation Learning

Lin, Alexander and Wohlwend, Jeremy and Chen, Howard and Lei, Tao. Autoregressive Knowledge Distillation through Imitation Learning. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.494

work page doi:10.18653/v1/2020.emnlp-main.494 2020
[25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

How do Large Language Models Handle Multilingualism? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[26]

L ang B ridge: Multilingual Reasoning Without Multilingual Supervision

Yoon, Dongkeun and Jang, Joel and Kim, Sungdong and Kim, Seungone and Shafayat, Sheikh and Seo, Minjoon. L ang B ridge: Multilingual Reasoning Without Multilingual Supervision. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.405

work page doi:10.18653/v1/2024.acl-long.405 2024
[27]

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Chen, Nuo and Zheng, Zinan and Wu, Ning and Gong, Ming and Zhang, Dongmei and Li, Jia. Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.411

work page doi:10.18653/v1/2024.findings-emnlp.411 2024
[28]

An Empirical Study of Multilingual Reasoning Distillation for Question Answering

Payoungkhamdee, Patomporn and Limkonchotiwat, Peerat and Baek, Jinheon and Manakul, Potsawee and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana. An Empirical Study of Multilingual Reasoning Distillation for Question Answering. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.1865...

work page doi:10.18653/v1/2024.emnlp-main.442 2024
[29]

When Less Language is More: Language-Reasoning Disentanglement Makes

Weixiang Zhao and Jiahe Guo and Yang Deng and Tongtong Wu and Wenxuan Zhang and Yulin Hu and Xingyu Sui and Yanyan Zhao and Wanxiang Che and Bing Qin and Tat-Seng Chua and Ting Liu , booktitle=. When Less Language is More: Language-Reasoning Disentanglement Makes. 2026 , url=

2026
[30]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[31]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[32]

2026 , eprint=

Qwen3.5-Omni Technical Report , author=. 2026 , eprint=

2026
[33]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[34]

and Vinyals, Oriol and Dean, Jeffrey , biburl =

Hinton, Geoffrey E. and Vinyals, Oriol and Dean, Jeffrey , biburl =. Distilling the Knowledge in a Neural Network. , url =. CoRR , keywords =
[35]

Sequence-Level Knowledge Distillation

Kim, Yoon and Rush, Alexander M. Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1139

work page doi:10.18653/v1/d16-1139 2016
[36]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=
[37]

On the Generalization of

Yongliang Wu and Yizhou Zhou and Zhou Ziheng and Yingzhe Peng and Xinyu Ye and Xinting Hu and Wenbo Zhu and Lu Qi and Ming-Hsuan Yang and Xu Yang , booktitle=. On the Generalization of. 2026 , url=

2026
[38]

ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving , year=

Self-Distillation Enables Continual Learning , author=. ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving , year=

2026
[39]

2025 , eprint=

Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation , author=. 2025 , eprint=

2025
[40]

2024 , eprint=

A Survey of Small Language Models , author=. 2024 , eprint=

2024
[41]

2025 , eprint=

Small Language Models (SLMs) Can Still Pack a Punch: A survey , author=. 2025 , eprint=

2025
[42]

2024 , eprint=

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , author=. 2024 , eprint=

2024
[43]

2026 , eprint=

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes , author=. 2026 , eprint=

2026

[1] [1]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

[2] [2]

The 1st Workshop on Scaling Post-training for LLMs , year=

Reinforcement Learning via Self-Distillation , author=. The 1st Workshop on Scaling Post-training for LLMs , year=

[3] [3]

ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations , year=

deep-significance: Easy and Meaningful Signifcance Testing in the Age of Neural Networks , author=. ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations , year=

[4] [4]

Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =

Rotem Dror and Segev Shlomov and Roi Reichart , editor =. Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =. 2019 , url =. doi:10.18653/v1/p19-1266 , timestamp =

work page doi:10.18653/v1/p19-1266 2019

[5] [5]

The Mathematics of the Uncertain , pages=

An optimal transportation approach for assessing almost stochastic order , author=. The Mathematics of the Uncertain , pages=. 2018 , publisher=

2018

[6] [6]

Nature , year=

Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and Chen, Yutian and Lillicrap, Timothy and Hui, Fan and Sifre, Laurent and van den Driessche, George and Graepel, Thore and Hassabis, Demis , title=. Nature , year=. do...

work page doi:10.1038/nature24270

[7] [7]

Tesauro, Gerald , title =. Commun. ACM , month = mar, pages =. 1995 , issue_date =. doi:10.1145/203330.203343 , abstract =

work page doi:10.1145/203330.203343 1995

[8] [8]

2026 , eprint=

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models , author=. 2026 , eprint=

2026

[9] [9]

arXiv preprint arXiv:2604.13016 , year=

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. arXiv preprint arXiv:2604.13016 , year=

Pith/arXiv arXiv

[10] [10]

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Zhang, Yuanchi and Wang, Yile and Liu, Zijun and Wang, Shuo and Wang, Xiaolong and Li, Peng and Sun, Maosong and Liu, Yang. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 202...

work page doi:10.18653/v1/2024.acl-long.603 2024

[11] [11]

Yu , keywords =

Libo Qin and Qiguang Chen and Yuhang Zhou and Zhi Chen and Yinghui Li and Lizi Liao and Min Li and Wanxiang Che and Philip S. Yu , keywords =. A survey of multilingual large language models , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.patter.2024.101118 , url =

work page doi:10.1016/j.patter.2024.101118 2025

[12] [12]

C 2 KD : Cross-layer and Cross-head Knowledge Distillation for Small Language Model-based Recommendation

Chen, Xiao and Ma, Changyi and Fan, Wenqi and Zhang, Zhaoxiang and Qing, Li. C 2 KD : Cross-layer and Cross-head Knowledge Distillation for Small Language Model-based Recommendation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.917

work page doi:10.18653/v1/2025.findings-acl.917 2025

[13] [13]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025

[14] [14]

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

Yang, Zhaorui and Pang, Tianyu and Feng, Haozhe and Wang, Han and Chen, Wei and Zhu, Minfeng and Liu, Qian. Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.58

work page doi:10.18653/v1/2024.acl-long.58 2024

[15] [15]

and Nguyen, Phat T

Pham, Thang M. and Nguyen, Phat T. and Yoon, Seunghyun and Lai, Viet Dac and Dernoncourt, Franck and Bui, Trung. S lim LM : An Efficient Small Language Model for On-Device Document Assistance. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2025. doi:10.18653/v1/2025.acl-demo.42

work page doi:10.18653/v1/2025.acl-demo.42 2025

[16] [16]

Shengding Hu and Yuge Tu and Xu Han and Ganqu Cui and Chaoqun He and Weilin Zhao and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Xinrong Zhang and Zhen Leng Thai and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and dahai li and Zhiyuan Liu and Maosong Su...

2024

[17] [17]

Zechun Liu and Changsheng Zhao and Forrest Iandola and Chen Lai and Yuandong Tian and Igor Fedorov and Yunyang Xiong and Ernie Chang and Yangyang Shi and Raghuraman Krishnamoorthi and Liangzhen Lai and Vikas Chandra , booktitle=. Mobile. 2024 , url=

2024

[18] [18]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Susanto, Yosephine and Hulagadri, Adithya Venkatadri and Montalan, Jann Railey and Ngui, Jian Gang and Yong, Xianbin and Leong, Wei Qi and Rengarajan, Hamsawardhini and Limkonchotiwat, Peerat and Mai, Yifan and Tjhi, William Chandra. SEA - HELM : S outheast A sian Holistic Evaluation of Language Models. Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.636 2025

[19] [19]

2026 , eprint=

TIP: Token Importance in On-Policy Distillation , author=. 2026 , eprint=

2026

[20] [20]

Jongwoo Ko and Tianyi Chen and Sungnyun Kim and Tianyu Ding and Luming Liang and Ilya Zharkov and Se-Young Yun , booktitle=. Disti. 2025 , url=

2025

[21] [21]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

[22] [22]

Jongwoo Ko and Sungnyun Kim and Tianyi Chen and Se-Young Yun , booktitle=. Disti. 2024 , url=

2024

[23] [23]

Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

2024

[24] [24]

Autoregressive Knowledge Distillation through Imitation Learning

Lin, Alexander and Wohlwend, Jeremy and Chen, Howard and Lei, Tao. Autoregressive Knowledge Distillation through Imitation Learning. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.494

work page doi:10.18653/v1/2020.emnlp-main.494 2020

[25] [25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

How do Large Language Models Handle Multilingualism? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[26] [26]

L ang B ridge: Multilingual Reasoning Without Multilingual Supervision

Yoon, Dongkeun and Jang, Joel and Kim, Sungdong and Kim, Seungone and Shafayat, Sheikh and Seo, Minjoon. L ang B ridge: Multilingual Reasoning Without Multilingual Supervision. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.405

work page doi:10.18653/v1/2024.acl-long.405 2024

[27] [27]

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Chen, Nuo and Zheng, Zinan and Wu, Ning and Gong, Ming and Zhang, Dongmei and Li, Jia. Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.411

work page doi:10.18653/v1/2024.findings-emnlp.411 2024

[28] [28]

An Empirical Study of Multilingual Reasoning Distillation for Question Answering

Payoungkhamdee, Patomporn and Limkonchotiwat, Peerat and Baek, Jinheon and Manakul, Potsawee and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana. An Empirical Study of Multilingual Reasoning Distillation for Question Answering. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.1865...

work page doi:10.18653/v1/2024.emnlp-main.442 2024

[29] [29]

When Less Language is More: Language-Reasoning Disentanglement Makes

Weixiang Zhao and Jiahe Guo and Yang Deng and Tongtong Wu and Wenxuan Zhang and Yulin Hu and Xingyu Sui and Yanyan Zhao and Wanxiang Che and Bing Qin and Tat-Seng Chua and Ting Liu , booktitle=. When Less Language is More: Language-Reasoning Disentanglement Makes. 2026 , url=

2026

[30] [30]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[31] [31]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[32] [32]

2026 , eprint=

Qwen3.5-Omni Technical Report , author=. 2026 , eprint=

2026

[33] [33]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[34] [34]

and Vinyals, Oriol and Dean, Jeffrey , biburl =

Hinton, Geoffrey E. and Vinyals, Oriol and Dean, Jeffrey , biburl =. Distilling the Knowledge in a Neural Network. , url =. CoRR , keywords =

[35] [35]

Sequence-Level Knowledge Distillation

Kim, Yoon and Rush, Alexander M. Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1139

work page doi:10.18653/v1/d16-1139 2016

[36] [36]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

[37] [37]

On the Generalization of

Yongliang Wu and Yizhou Zhou and Zhou Ziheng and Yingzhe Peng and Xinyu Ye and Xinting Hu and Wenbo Zhu and Lu Qi and Ming-Hsuan Yang and Xu Yang , booktitle=. On the Generalization of. 2026 , url=

2026

[38] [38]

ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving , year=

Self-Distillation Enables Continual Learning , author=. ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving , year=

2026

[39] [39]

2025 , eprint=

Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation , author=. 2025 , eprint=

2025

[40] [40]

2024 , eprint=

A Survey of Small Language Models , author=. 2024 , eprint=

2024

[41] [41]

2025 , eprint=

Small Language Models (SLMs) Can Still Pack a Punch: A survey , author=. 2025 , eprint=

2025

[42] [42]

2024 , eprint=

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , author=. 2024 , eprint=

2024

[43] [43]

2026 , eprint=

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes , author=. 2026 , eprint=

2026