Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

Cong-Thanh Do; Kate Knill; Stella Eva Tsiapali

arxiv: 2603.22056 · v2 · pith:ILYMW5VHnew · submitted 2026-03-23 · 💻 cs.CL

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

Stella Eva Tsiapali , Cong-Thanh Do , Kate Knill This is my paper

Pith reviewed 2026-05-21 10:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationlarge language modelsvocabulary mismatchgenerative adversarial learningkey-query matchingdual-space distillationROUGE-Ltext generation

0 comments

The pith

Generative adversarial learning aligns mismatched key and query distributions to improve cross-tokenizer knowledge distillation for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the attention mechanisms inside dual-space knowledge distillation for large language models that use different tokenizers. It identifies mismatched distributions between keys and queries computed from the distinct models as a core limitation. To correct this, the authors add generative adversarial learning that aligns those distributions without altering the overall distillation setup. Experiments then show modest but consistent ROUGE-L gains in generated text, with larger benefits on data outside the training distribution. Readers concerned with deploying smaller models would care because the approach narrows the remaining performance difference with same-tokenizer distillation.

Core claim

The paper claims that adding generative adversarial learning to DSKD-CMA produces DSKD-CMA-GA, which aligns the mismatched key and query distributions from models with distinct tokenizers. This change delivers modest but consistent ROUGE-L gains in text generation quality, especially an average +0.37 improvement on out-of-distribution data, and thereby reduces the performance gap relative to same-tokenizer knowledge distillation.

What carries the argument

DSKD-CMA-GA, the dual-space knowledge distillation method that uses generative adversarial learning to align mismatched key and query distributions computed from distinct models.

If this is right

Text generation quality improves modestly but consistently across evaluation tasks.
Gains are larger on out-of-distribution data, averaging +0.37 ROUGE-L.
The performance difference between cross-tokenizer and same-tokenizer distillation shrinks.
Smaller student models can more effectively mimic larger teachers despite vocabulary mismatches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adversarial alignment technique may extend to other distillation settings that suffer from representation mismatches.
This approach could reduce reliance on tokenizer alignment steps during model compression pipelines.
Scaling experiments on larger model families would test whether the key-query matching benefit holds at greater sizes.
The method shares structure with adversarial domain adaptation used in other NLP transfer tasks.

Load-bearing premise

Generative adversarial learning will align the mismatched key and query distributions from distinct models without introducing instability or degrading the other distillation objectives.

What would settle it

A controlled comparison of DSKD-CMA with and without the generative adversarial component on out-of-distribution test sets that shows no ROUGE-L improvement or added instability would falsify the benefit of the alignment step.

read the original abstract

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a GAN term to DSKD-CMA to align mismatched key-query distributions in cross-tokenizer distillation and reports modest ROUGE-L gains on OOD data, but provides no direct checks that alignment occurred or that the new term drives the improvement.

read the letter

This paper adds a generative adversarial learning component to DSKD-CMA to fix mismatched key and query distributions from different models in cross-tokenizer knowledge distillation. The gains are modest, with a +0.37 average ROUGE-L improvement on out-of-distribution data, but there's no direct check that the adversarial part is doing the alignment work. They start with a useful analysis of the existing DSKD-CMA method. Using manual token alignment probing and heatmap visualizations, they show both where the attention mechanism works well and where it falls short for vocabulary mismatch. That diagnostic step is a solid foundation for the extension. The new DSKD-CMA-GA method applies GA learning to align the distributions. Experiments indicate it helps narrow the performance gap between cross- and same-tokenizer KD, particularly on OOD cases. The soft spot is that the paper does not provide quantitative evidence of successful alignment, like measuring distribution similarity before and after the GA objective. There are also no ablations isolating the contribution of the adversarial term, and no discussion of training stability or multiple runs to rule out noise. Since GANs can be unstable, it's hard to tell if the small gain is reliable or just from extra optimization. This work is aimed at researchers dealing with LLM deployment and distillation across different tokenizers. It would be of interest to those looking for practical ways to improve KD when vocabularies don't match. The paper shows clear thinking about the problem and builds directly on prior work, so it is worth sending to peer review for a full evaluation of the experiments and claims.

Referee Report

3 major / 2 minor

Summary. The paper analyzes the attention mechanism in Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) for LLMs with vocabulary mismatch, identifying limitations via token alignment probing and heatmaps. It proposes DSKD-CMA-GA, which augments the method with a generative adversarial (GA) objective to align mismatched key and query distributions computed from distinct teacher and student models. Experiments report modest but consistent ROUGE-L gains in text generation, averaging +0.37 on out-of-distribution data and narrowing the gap to same-tokenizer KD.

Significance. If the central claim holds, the work offers an incremental but practical advance for knowledge distillation across tokenizer boundaries, a frequent real-world constraint. The analysis of DSKD-CMA provides useful diagnostic insight, and the adversarial extension is a reasonable direction. However, the modest size of the reported gains limits the potential impact unless stronger evidence links them specifically to distribution alignment.

major comments (3)

[§3] §3 (Method, DSKD-CMA-GA description): No quantitative verification is provided that the generative adversarial term actually aligns the key and query distributions (e.g., no pre/post MMD, Wasserstein, or cosine-distance measurements between the two distributions). Without this, it is impossible to confirm that the reported ROUGE-L gains arise from successful matching rather than incidental effects of additional training.
[§4] §4 (Experiments): The manuscript lacks an ablation that isolates the contribution of the GA objective from the base DSKD-CMA components. This is load-bearing for the claim that the adversarial term is responsible for narrowing the cross- vs. same-tokenizer gap.
[§4] §4 (Experiments): No loss curves, multiple random seeds, or stability diagnostics are reported for the combined distillation + adversarial objective, despite well-known instability risks of GAN-style training. This leaves open the possibility that the +0.37 OOD gain is within noise or an artifact of extra optimization steps.

minor comments (2)

[Abstract] The abstract and method sections use “GA” for generative adversarial without initially spelling out the acronym on first use.
[Figures] Figure captions for attention heatmaps could more explicitly state the tokenization difference being visualized to aid readers unfamiliar with the vocabulary mismatch setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our paper. We address each of the major comments point by point below, and we have revised the manuscript accordingly to incorporate additional analyses and experiments where feasible.

read point-by-point responses

Referee: [§3] §3 (Method, DSKD-CMA-GA description): No quantitative verification is provided that the generative adversarial term actually aligns the key and query distributions (e.g., no pre/post MMD, Wasserstein, or cosine-distance measurements between the two distributions). Without this, it is impossible to confirm that the reported ROUGE-L gains arise from successful matching rather than incidental effects of additional training.

Authors: We agree that providing quantitative evidence of the alignment achieved by the generative adversarial objective would better support our claims. In the revised manuscript, we will add pre- and post-training measurements of distribution distances, including MMD and average cosine similarity between the key and query vectors from the teacher and student models. These will be presented in Section 3 to demonstrate the effectiveness of the GA term in reducing the mismatch. revision: yes
Referee: [§4] §4 (Experiments): The manuscript lacks an ablation that isolates the contribution of the GA objective from the base DSKD-CMA components. This is load-bearing for the claim that the adversarial term is responsible for narrowing the cross- vs. same-tokenizer gap.

Authors: We appreciate this point and acknowledge the value of a dedicated ablation study. We will include an ablation analysis in the revised Section 4, where we compare the performance of DSKD-CMA with and without the GA objective across the evaluated datasets. This will explicitly isolate the contribution of the adversarial component to the observed ROUGE-L improvements and the narrowing of the performance gap. revision: yes
Referee: [§4] §4 (Experiments): No loss curves, multiple random seeds, or stability diagnostics are reported for the combined distillation + adversarial objective, despite well-known instability risks of GAN-style training. This leaves open the possibility that the +0.37 OOD gain is within noise or an artifact of extra optimization steps.

Authors: We recognize the potential concerns regarding the stability of the adversarial training. To address this, we will report loss curves for the distillation and adversarial losses in the revised manuscript. Additionally, we will conduct experiments with multiple random seeds and provide mean and standard deviation for the ROUGE-L scores to demonstrate the robustness of the results. This will help rule out the possibility that the gains are due to noise or optimization artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no self-referential derivations

full rationale

The paper presents DSKD-CMA-GA as an empirical extension of prior DSKD-CMA, adding a generative adversarial term to align mismatched key/query distributions from distinct models. No equations, derivations, or first-principles claims appear that reduce performance gains (e.g., the reported +0.37 ROUGE-L) to a fitted quantity defined by the method itself or to a self-citation chain. The central result is an experimental observation on text generation quality, not a mathematical prediction forced by construction. External benchmarks and ablations would be needed to validate the alignment claim, but this is a correctness issue rather than circularity. The derivation chain is self-contained as a practical proposal without load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of the GAN component.

pith-pipeline@v0.9.0 · 5730 in / 1080 out tokens · 35343 ms · 2026-05-21T10:54:38.847833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose to integrate key-query matching into DSKD-CMA... Generative Adversarial Alignment (GA)... LKQ = min_PQ max_D (E_q∼pQ [log D(q)] + E_k∼pK [log(1−D(k))])
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DSKD-CMA-GA achieves... ROUGE-L gains of 0.15–1.04 points... narrows the gap between same- and cross-tokenizer performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

[1]

Knowl- edge Distillation (KD) mitigates this by transferring capabilities from large Teacher models to smaller Student models, preserving performance while improving efficiency

INTRODUCTION The rise of large language models (LLMs) has driven major ad- vances in text generation and reasoning, yet their scale makes deployment costly in computation, latency, and energy [1]. Knowl- edge Distillation (KD) mitigates this by transferring capabilities from large Teacher models to smaller Student models, preserving performance while impr...

work page
[2]

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

BACKGROUND AND RELA TED WORK 2.1. Knowledge Distillation (KD) KD transfers knowledge from a large Teacher model to a smaller Student model, improving efficiency with minimal performance loss. Instead of relearning from data, the Student mimics the Teacher’s behaviour to capture its emergent abilities [1]. In its simplest form, black-box KD, the Student is...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Original DSKD-CMA Method To demonstrate how our methods fit into the existing framework, we provide a brief overview of DSKD-CMA [3]

METHODS 3.1. Original DSKD-CMA Method To demonstrate how our methods fit into the existing framework, we provide a brief overview of DSKD-CMA [3]. ACross-Model Attention(CMA) mechanism is employed to align Teacher and Student hidden states of distinct dimensions. The Student embeddings are projected into the Teacher space to form queries,Q, while the Teac...

work page
[4]

EXPERIMENTAL SETUP DatasetsFollowing [3], we use the DataBricks Dolly 15K dataset

work page
[5]

for distillation, with a 11K-1K train-validation split. For eval- uation, we test in-distribution onDolly(500 samples) and out-of- distribution on Self-Instruct (SelfInst, 242 samples) [19], Vicuna- Eval (Vicuna, 80 samples) [20], Super-Natural Instructions (S-NI, 1,649 samples) [21], and Unnatural Instructions (UnNI, 23,916 sam- ples) [22], totalling 26,...

work page
[6]

The initial Student achieves just 20-50% of the Teachers’ performance, highlighting the gap that KD has to close

RESULTS AND DISCUSSION Table 1 summarises results for the Student and Teacher models, KD baselines and all DSKD variants tested. The initial Student achieves just 20-50% of the Teachers’ performance, highlighting the gap that KD has to close. 5.1. Chunk-Based Probing Insights To better understand the role of CMA, we compared it against chunk-based alterna...

work page arXiv
[7]

CONCLUSION This paper has presented a methodical analysis and extension of DSKD-CMA [3], a SOTA method in cross-tokenizer KD. Through chunk-level alignment experiments, we confirmed that CMA implic- itly captures the expected chunk structure of token sequences, while also revealing weaknesses in the localisation of its mappings. Based on this insight, we ...

work page
[8]

The authors have no relevant financial or non-financial interests to disclose

ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non-financial interests to disclose

work page
[9]

COMPLIANCE WITH ETHICAL STANDARDS This study is on the training and evaluation of machine learning models, so no ethical approval was required

work page
[10]

Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,

Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, et al., “Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,”ACM Transactions on Intelligent Systems and Technology, 2024

work page 2024
[11]

Universal cross-tokenizer distillation via approximate likelihood matching.arXiv preprint arXiv:2503.20083, 2025

Benjamin Minixhofer, Ivan Vuli ´c, and Edoardo Maria Ponti, “Universal Cross-Tokenizer Distillation via Approximate Li- kelihood Matching,”arXiv preprint arXiv:2503.20083, 2025

work page arXiv 2025
[12]

Dual-Space Knowledge Distillation for Large Lan- guage Models,

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu, “Dual-Space Knowledge Distillation for Large Lan- guage Models,” inThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 18164–18181

work page 2024
[13]

Sequence-Level Knowl- edge Distillation,

Yoon Kim and Alexander M. Rush, “Sequence-Level Knowl- edge Distillation,” inThe 2016 Conference on Empirical Meth- ods in Natural Language Processing, 2016, pp. 1317–1327

work page 2016
[14]

Synthetic Data Gener- ation Method for Data-Free Knowledge Distillation in Regres- sion Neural Networks,

Tianxun Zhou and Keng-Hwee Chiam, “Synthetic Data Gener- ation Method for Data-Free Knowledge Distillation in Regres- sion Neural Networks,”Expert Systems with Applications, vol. 227, no. C, 2023

work page 2023
[15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the Knowledge in a Neural Network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[17]

TinyBERT: Distilling BERT for Natural Language Understanding,

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et al., “TinyBERT: Distilling BERT for Natural Language Understanding,” inFindings of the ACL, 2020, pp. 4163–4174

work page 2020
[18]

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou, “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,” inThe 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[19]

On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, et al., “On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,” inThe 12th International Conference on Learning Representations, 2024

work page 2024
[20]

Specializing Smaller Language Models towards Multi- Step Reasoning,

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot, “Specializing Smaller Language Models towards Multi- Step Reasoning,” inThe 40th International Conference on Machine Learning, 2023, vol. 202 ofProceedings of Machine Learning Research, pp. 10421–10430

work page 2023
[21]

Knowledge Fusion of Large Language Models,

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi, “Knowledge Fusion of Large Language Models,” inThe 12th International Conference on Learning Representations, 2024

work page 2024
[22]

Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou, “Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,” inFindings of the ACL, 2025, pp. 8005–8018

work page 2025
[23]

Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo, “Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,”Transactions on Ma- chine Learning Research, 2025

work page 2025
[24]

CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, et al., “COT 2ALIGN: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers,”arXiv preprint arXiv:2502.16806, 2025

work page arXiv 2025
[25]

Alignment Attention by Match- ing Key and Query Distributions,

Shujian Zhang, Xinjie Fan, Huangjie Zheng, Korawat Tan- wisuth, and Mingyuan Zhou, “Alignment Attention by Match- ing Key and Query Distributions,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 13444–13457, 2021

work page 2021
[26]

338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008

Cédric Villani,Optimal Transport: Old and New, vol. 338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008

work page 2008
[27]

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, et al., “Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,” 2023

work page 2023
[28]

Self-Instruct: Align- ing Language Models with Self-Generated Instructions,

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, et al., “Self-Instruct: Align- ing Language Models with Self-Generated Instructions,” in The 61st Annual Meeting of the ACL, 2023, pp. 13484–13508

work page 2023
[29]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, et al., “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” 2023

work page 2023
[30]

Benchmarking generalization via in-context instructions on 1,600+ language tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, et al., “Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks,”ArXiv, vol. abs/2204.07705, 2022

work page arXiv 2022
[31]

Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick, “Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,” inThe 61st Annual Meeting of the ACL, 2023, pp. 14409–14428

work page 2023
[32]

MiniLLM: Knowledge Distillation of Large Language Mod- els,

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang, “MiniLLM: Knowledge Distillation of Large Language Mod- els,” inThe 12th International Conference on Learning Repre- sentations, 2024

work page 2024
[33]

On Information and Suffi- ciency,

S. Kullback and R. A. Leibler, “On Information and Suffi- ciency,”The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951

work page 1951
[34]

DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun, “DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,” inThe 41st International Conference on Ma- chine Learning, 2024

work page 2024
[35]

Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong, “Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,” inThe 31st International Conference on Computational Lin- guistics, 2025, pp. 5737–5755

work page 2025
[36]

The Jensen-Shannon Divergence,

M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo, “The Jensen-Shannon Divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997

work page 1997
[37]

ROUGE: A Package for Automatic Evaluation of Summaries,

Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” inText Summarization Branches Out, 2004, pp. 74–81

work page 2004
[38]

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,

Cong Thanh Do, Rama Sanand Doddipatla, and Kate Knill, “Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,” inThe 18th Inter- national Natural Language Generation Conference, 2025, pp. 833–845

work page 2025

[1] [1]

Knowl- edge Distillation (KD) mitigates this by transferring capabilities from large Teacher models to smaller Student models, preserving performance while improving efficiency

INTRODUCTION The rise of large language models (LLMs) has driven major ad- vances in text generation and reasoning, yet their scale makes deployment costly in computation, latency, and energy [1]. Knowl- edge Distillation (KD) mitigates this by transferring capabilities from large Teacher models to smaller Student models, preserving performance while impr...

work page

[2] [2]

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

BACKGROUND AND RELA TED WORK 2.1. Knowledge Distillation (KD) KD transfers knowledge from a large Teacher model to a smaller Student model, improving efficiency with minimal performance loss. Instead of relearning from data, the Student mimics the Teacher’s behaviour to capture its emergent abilities [1]. In its simplest form, black-box KD, the Student is...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Original DSKD-CMA Method To demonstrate how our methods fit into the existing framework, we provide a brief overview of DSKD-CMA [3]

METHODS 3.1. Original DSKD-CMA Method To demonstrate how our methods fit into the existing framework, we provide a brief overview of DSKD-CMA [3]. ACross-Model Attention(CMA) mechanism is employed to align Teacher and Student hidden states of distinct dimensions. The Student embeddings are projected into the Teacher space to form queries,Q, while the Teac...

work page

[4] [4]

EXPERIMENTAL SETUP DatasetsFollowing [3], we use the DataBricks Dolly 15K dataset

work page

[5] [5]

for distillation, with a 11K-1K train-validation split. For eval- uation, we test in-distribution onDolly(500 samples) and out-of- distribution on Self-Instruct (SelfInst, 242 samples) [19], Vicuna- Eval (Vicuna, 80 samples) [20], Super-Natural Instructions (S-NI, 1,649 samples) [21], and Unnatural Instructions (UnNI, 23,916 sam- ples) [22], totalling 26,...

work page

[6] [6]

The initial Student achieves just 20-50% of the Teachers’ performance, highlighting the gap that KD has to close

RESULTS AND DISCUSSION Table 1 summarises results for the Student and Teacher models, KD baselines and all DSKD variants tested. The initial Student achieves just 20-50% of the Teachers’ performance, highlighting the gap that KD has to close. 5.1. Chunk-Based Probing Insights To better understand the role of CMA, we compared it against chunk-based alterna...

work page arXiv

[7] [7]

CONCLUSION This paper has presented a methodical analysis and extension of DSKD-CMA [3], a SOTA method in cross-tokenizer KD. Through chunk-level alignment experiments, we confirmed that CMA implic- itly captures the expected chunk structure of token sequences, while also revealing weaknesses in the localisation of its mappings. Based on this insight, we ...

work page

[8] [8]

The authors have no relevant financial or non-financial interests to disclose

ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non-financial interests to disclose

work page

[9] [9]

COMPLIANCE WITH ETHICAL STANDARDS This study is on the training and evaluation of machine learning models, so no ethical approval was required

work page

[10] [10]

Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,

Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, et al., “Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,”ACM Transactions on Intelligent Systems and Technology, 2024

work page 2024

[11] [11]

Universal cross-tokenizer distillation via approximate likelihood matching.arXiv preprint arXiv:2503.20083, 2025

Benjamin Minixhofer, Ivan Vuli ´c, and Edoardo Maria Ponti, “Universal Cross-Tokenizer Distillation via Approximate Li- kelihood Matching,”arXiv preprint arXiv:2503.20083, 2025

work page arXiv 2025

[12] [12]

Dual-Space Knowledge Distillation for Large Lan- guage Models,

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu, “Dual-Space Knowledge Distillation for Large Lan- guage Models,” inThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 18164–18181

work page 2024

[13] [13]

Sequence-Level Knowl- edge Distillation,

Yoon Kim and Alexander M. Rush, “Sequence-Level Knowl- edge Distillation,” inThe 2016 Conference on Empirical Meth- ods in Natural Language Processing, 2016, pp. 1317–1327

work page 2016

[14] [14]

Synthetic Data Gener- ation Method for Data-Free Knowledge Distillation in Regres- sion Neural Networks,

Tianxun Zhou and Keng-Hwee Chiam, “Synthetic Data Gener- ation Method for Data-Free Knowledge Distillation in Regres- sion Neural Networks,”Expert Systems with Applications, vol. 227, no. C, 2023

work page 2023

[15] [15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the Knowledge in a Neural Network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[17] [17]

TinyBERT: Distilling BERT for Natural Language Understanding,

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et al., “TinyBERT: Distilling BERT for Natural Language Understanding,” inFindings of the ACL, 2020, pp. 4163–4174

work page 2020

[18] [18]

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou, “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,” inThe 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[19] [19]

On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, et al., “On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,” inThe 12th International Conference on Learning Representations, 2024

work page 2024

[20] [20]

Specializing Smaller Language Models towards Multi- Step Reasoning,

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot, “Specializing Smaller Language Models towards Multi- Step Reasoning,” inThe 40th International Conference on Machine Learning, 2023, vol. 202 ofProceedings of Machine Learning Research, pp. 10421–10430

work page 2023

[21] [21]

Knowledge Fusion of Large Language Models,

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi, “Knowledge Fusion of Large Language Models,” inThe 12th International Conference on Learning Representations, 2024

work page 2024

[22] [22]

Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,

Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou, “Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,” inFindings of the ACL, 2025, pp. 8005–8018

work page 2025

[23] [23]

Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo, “Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,”Transactions on Ma- chine Learning Research, 2025

work page 2025

[24] [24]

CoT2Align: Cross-chain of thought distillation via optimal transport alignment for language models with different tokenizers.arXiv preprint, arXiv:2502.16806, 2025

Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, et al., “COT 2ALIGN: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers,”arXiv preprint arXiv:2502.16806, 2025

work page arXiv 2025

[25] [25]

Alignment Attention by Match- ing Key and Query Distributions,

Shujian Zhang, Xinjie Fan, Huangjie Zheng, Korawat Tan- wisuth, and Mingyuan Zhou, “Alignment Attention by Match- ing Key and Query Distributions,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 13444–13457, 2021

work page 2021

[26] [26]

338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008

Cédric Villani,Optimal Transport: Old and New, vol. 338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008

work page 2008

[27] [27]

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, et al., “Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,” 2023

work page 2023

[28] [28]

Self-Instruct: Align- ing Language Models with Self-Generated Instructions,

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, et al., “Self-Instruct: Align- ing Language Models with Self-Generated Instructions,” in The 61st Annual Meeting of the ACL, 2023, pp. 13484–13508

work page 2023

[29] [29]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, et al., “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” 2023

work page 2023

[30] [30]

Benchmarking generalization via in-context instructions on 1,600+ language tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, et al., “Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks,”ArXiv, vol. abs/2204.07705, 2022

work page arXiv 2022

[31] [31]

Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick, “Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,” inThe 61st Annual Meeting of the ACL, 2023, pp. 14409–14428

work page 2023

[32] [32]

MiniLLM: Knowledge Distillation of Large Language Mod- els,

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang, “MiniLLM: Knowledge Distillation of Large Language Mod- els,” inThe 12th International Conference on Learning Repre- sentations, 2024

work page 2024

[33] [33]

On Information and Suffi- ciency,

S. Kullback and R. A. Leibler, “On Information and Suffi- ciency,”The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951

work page 1951

[34] [34]

DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun, “DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,” inThe 41st International Conference on Ma- chine Learning, 2024

work page 2024

[35] [35]

Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong, “Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,” inThe 31st International Conference on Computational Lin- guistics, 2025, pp. 5737–5755

work page 2025

[36] [36]

The Jensen-Shannon Divergence,

M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo, “The Jensen-Shannon Divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997

work page 1997

[37] [37]

ROUGE: A Package for Automatic Evaluation of Summaries,

Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” inText Summarization Branches Out, 2004, pp. 74–81

work page 2004

[38] [38]

Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,

Cong Thanh Do, Rama Sanand Doddipatla, and Kate Knill, “Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,” inThe 18th Inter- national Natural Language Generation Conference, 2025, pp. 833–845

work page 2025