Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch
Pith reviewed 2026-05-21 10:54 UTC · model grok-4.3
The pith
Generative adversarial learning aligns mismatched key and query distributions to improve cross-tokenizer knowledge distillation for LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that adding generative adversarial learning to DSKD-CMA produces DSKD-CMA-GA, which aligns the mismatched key and query distributions from models with distinct tokenizers. This change delivers modest but consistent ROUGE-L gains in text generation quality, especially an average +0.37 improvement on out-of-distribution data, and thereby reduces the performance gap relative to same-tokenizer knowledge distillation.
What carries the argument
DSKD-CMA-GA, the dual-space knowledge distillation method that uses generative adversarial learning to align mismatched key and query distributions computed from distinct models.
If this is right
- Text generation quality improves modestly but consistently across evaluation tasks.
- Gains are larger on out-of-distribution data, averaging +0.37 ROUGE-L.
- The performance difference between cross-tokenizer and same-tokenizer distillation shrinks.
- Smaller student models can more effectively mimic larger teachers despite vocabulary mismatches.
Where Pith is reading between the lines
- The adversarial alignment technique may extend to other distillation settings that suffer from representation mismatches.
- This approach could reduce reliance on tokenizer alignment steps during model compression pipelines.
- Scaling experiments on larger model families would test whether the key-query matching benefit holds at greater sizes.
- The method shares structure with adversarial domain adaptation used in other NLP transfer tasks.
Load-bearing premise
Generative adversarial learning will align the mismatched key and query distributions from distinct models without introducing instability or degrading the other distillation objectives.
What would settle it
A controlled comparison of DSKD-CMA with and without the generative adversarial component on out-of-distribution test sets that shows no ROUGE-L improvement or added instability would falsify the benefit of the alignment step.
read the original abstract
Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the attention mechanism in Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) for LLMs with vocabulary mismatch, identifying limitations via token alignment probing and heatmaps. It proposes DSKD-CMA-GA, which augments the method with a generative adversarial (GA) objective to align mismatched key and query distributions computed from distinct teacher and student models. Experiments report modest but consistent ROUGE-L gains in text generation, averaging +0.37 on out-of-distribution data and narrowing the gap to same-tokenizer KD.
Significance. If the central claim holds, the work offers an incremental but practical advance for knowledge distillation across tokenizer boundaries, a frequent real-world constraint. The analysis of DSKD-CMA provides useful diagnostic insight, and the adversarial extension is a reasonable direction. However, the modest size of the reported gains limits the potential impact unless stronger evidence links them specifically to distribution alignment.
major comments (3)
- [§3] §3 (Method, DSKD-CMA-GA description): No quantitative verification is provided that the generative adversarial term actually aligns the key and query distributions (e.g., no pre/post MMD, Wasserstein, or cosine-distance measurements between the two distributions). Without this, it is impossible to confirm that the reported ROUGE-L gains arise from successful matching rather than incidental effects of additional training.
- [§4] §4 (Experiments): The manuscript lacks an ablation that isolates the contribution of the GA objective from the base DSKD-CMA components. This is load-bearing for the claim that the adversarial term is responsible for narrowing the cross- vs. same-tokenizer gap.
- [§4] §4 (Experiments): No loss curves, multiple random seeds, or stability diagnostics are reported for the combined distillation + adversarial objective, despite well-known instability risks of GAN-style training. This leaves open the possibility that the +0.37 OOD gain is within noise or an artifact of extra optimization steps.
minor comments (2)
- [Abstract] The abstract and method sections use “GA” for generative adversarial without initially spelling out the acronym on first use.
- [Figures] Figure captions for attention heatmaps could more explicitly state the tokenization difference being visualized to aid readers unfamiliar with the vocabulary mismatch setting.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our paper. We address each of the major comments point by point below, and we have revised the manuscript accordingly to incorporate additional analyses and experiments where feasible.
read point-by-point responses
-
Referee: [§3] §3 (Method, DSKD-CMA-GA description): No quantitative verification is provided that the generative adversarial term actually aligns the key and query distributions (e.g., no pre/post MMD, Wasserstein, or cosine-distance measurements between the two distributions). Without this, it is impossible to confirm that the reported ROUGE-L gains arise from successful matching rather than incidental effects of additional training.
Authors: We agree that providing quantitative evidence of the alignment achieved by the generative adversarial objective would better support our claims. In the revised manuscript, we will add pre- and post-training measurements of distribution distances, including MMD and average cosine similarity between the key and query vectors from the teacher and student models. These will be presented in Section 3 to demonstrate the effectiveness of the GA term in reducing the mismatch. revision: yes
-
Referee: [§4] §4 (Experiments): The manuscript lacks an ablation that isolates the contribution of the GA objective from the base DSKD-CMA components. This is load-bearing for the claim that the adversarial term is responsible for narrowing the cross- vs. same-tokenizer gap.
Authors: We appreciate this point and acknowledge the value of a dedicated ablation study. We will include an ablation analysis in the revised Section 4, where we compare the performance of DSKD-CMA with and without the GA objective across the evaluated datasets. This will explicitly isolate the contribution of the adversarial component to the observed ROUGE-L improvements and the narrowing of the performance gap. revision: yes
-
Referee: [§4] §4 (Experiments): No loss curves, multiple random seeds, or stability diagnostics are reported for the combined distillation + adversarial objective, despite well-known instability risks of GAN-style training. This leaves open the possibility that the +0.37 OOD gain is within noise or an artifact of extra optimization steps.
Authors: We recognize the potential concerns regarding the stability of the adversarial training. To address this, we will report loss curves for the distillation and adversarial losses in the revised manuscript. Additionally, we will conduct experiments with multiple random seeds and provide mean and standard deviation for the ROUGE-L scores to demonstrate the robustness of the results. This will help rule out the possibility that the gains are due to noise or optimization artifacts. revision: yes
Circularity Check
No circularity: empirical method proposal with no self-referential derivations
full rationale
The paper presents DSKD-CMA-GA as an empirical extension of prior DSKD-CMA, adding a generative adversarial term to align mismatched key/query distributions from distinct models. No equations, derivations, or first-principles claims appear that reduce performance gains (e.g., the reported +0.37 ROUGE-L) to a fitted quantity defined by the method itself or to a self-citation chain. The central result is an experimental observation on text generation quality, not a mathematical prediction forced by construction. External benchmarks and ablations would be needed to validate the alignment claim, but this is a correctness issue rather than circularity. The derivation chain is self-contained as a practical proposal without load-bearing self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose to integrate key-query matching into DSKD-CMA... Generative Adversarial Alignment (GA)... LKQ = min_PQ max_D (E_q∼pQ [log D(q)] + E_k∼pK [log(1−D(k))])
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DSKD-CMA-GA achieves... ROUGE-L gains of 0.15–1.04 points... narrows the gap between same- and cross-tokenizer performance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The rise of large language models (LLMs) has driven major ad- vances in text generation and reasoning, yet their scale makes deployment costly in computation, latency, and energy [1]. Knowl- edge Distillation (KD) mitigates this by transferring capabilities from large Teacher models to smaller Student models, preserving performance while impr...
-
[2]
BACKGROUND AND RELA TED WORK 2.1. Knowledge Distillation (KD) KD transfers knowledge from a large Teacher model to a smaller Student model, improving efficiency with minimal performance loss. Instead of relearning from data, the Student mimics the Teacher’s behaviour to capture its emergent abilities [1]. In its simplest form, black-box KD, the Student is...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
METHODS 3.1. Original DSKD-CMA Method To demonstrate how our methods fit into the existing framework, we provide a brief overview of DSKD-CMA [3]. ACross-Model Attention(CMA) mechanism is employed to align Teacher and Student hidden states of distinct dimensions. The Student embeddings are projected into the Teacher space to form queries,Q, while the Teac...
-
[4]
EXPERIMENTAL SETUP DatasetsFollowing [3], we use the DataBricks Dolly 15K dataset
-
[5]
for distillation, with a 11K-1K train-validation split. For eval- uation, we test in-distribution onDolly(500 samples) and out-of- distribution on Self-Instruct (SelfInst, 242 samples) [19], Vicuna- Eval (Vicuna, 80 samples) [20], Super-Natural Instructions (S-NI, 1,649 samples) [21], and Unnatural Instructions (UnNI, 23,916 sam- ples) [22], totalling 26,...
-
[6]
RESULTS AND DISCUSSION Table 1 summarises results for the Student and Teacher models, KD baselines and all DSKD variants tested. The initial Student achieves just 20-50% of the Teachers’ performance, highlighting the gap that KD has to close. 5.1. Chunk-Based Probing Insights To better understand the role of CMA, we compared it against chunk-based alterna...
-
[7]
CONCLUSION This paper has presented a methodical analysis and extension of DSKD-CMA [3], a SOTA method in cross-tokenizer KD. Through chunk-level alignment experiments, we confirmed that CMA implic- itly captures the expected chunk structure of token sequences, while also revealing weaknesses in the localisation of its mappings. Based on this insight, we ...
-
[8]
The authors have no relevant financial or non-financial interests to disclose
ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non-financial interests to disclose
-
[9]
COMPLIANCE WITH ETHICAL STANDARDS This study is on the training and evaluation of machine learning models, so no ethical approval was required
-
[10]
Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,
Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, et al., “Survey on Knowledge Distil- lation for Large Language Models: Methods, Evaluation, and Application,”ACM Transactions on Intelligent Systems and Technology, 2024
work page 2024
-
[11]
Benjamin Minixhofer, Ivan Vuli ´c, and Edoardo Maria Ponti, “Universal Cross-Tokenizer Distillation via Approximate Li- kelihood Matching,”arXiv preprint arXiv:2503.20083, 2025
-
[12]
Dual-Space Knowledge Distillation for Large Lan- guage Models,
Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu, “Dual-Space Knowledge Distillation for Large Lan- guage Models,” inThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 18164–18181
work page 2024
-
[13]
Sequence-Level Knowl- edge Distillation,
Yoon Kim and Alexander M. Rush, “Sequence-Level Knowl- edge Distillation,” inThe 2016 Conference on Empirical Meth- ods in Natural Language Processing, 2016, pp. 1317–1327
work page 2016
-
[14]
Tianxun Zhou and Keng-Hwee Chiam, “Synthetic Data Gener- ation Method for Data-Free Knowledge Distillation in Regres- sion Neural Networks,”Expert Systems with Applications, vol. 227, no. C, 2023
work page 2023
-
[15]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the Knowledge in a Neural Network,”arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,”arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[17]
TinyBERT: Distilling BERT for Natural Language Understanding,
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et al., “TinyBERT: Distilling BERT for Natural Language Understanding,” inFindings of the ACL, 2020, pp. 4163–4174
work page 2020
-
[18]
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou, “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,” inThe 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[19]
On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, et al., “On- Policy Distillation of Language Models: Learning from Self- Generated Mistakes,” inThe 12th International Conference on Learning Representations, 2024
work page 2024
-
[20]
Specializing Smaller Language Models towards Multi- Step Reasoning,
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot, “Specializing Smaller Language Models towards Multi- Step Reasoning,” inThe 40th International Conference on Machine Learning, 2023, vol. 202 ofProceedings of Machine Learning Research, pp. 10421–10430
work page 2023
-
[21]
Knowledge Fusion of Large Language Models,
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi, “Knowledge Fusion of Large Language Models,” inThe 12th International Conference on Learning Representations, 2024
work page 2024
-
[22]
Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,
Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou, “Enhancing Cross-Tokenizer Knowledge Dis- tillation with Contextual Dynamical Mapping,” inFindings of the ACL, 2025, pp. 8005–8018
work page 2025
-
[23]
Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,
Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo, “Towards Cross-Tokenizer Distillation: the Univer- sal Logit Distillation Loss for LLMs,”Transactions on Ma- chine Learning Research, 2025
work page 2025
-
[24]
Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, et al., “COT 2ALIGN: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers,”arXiv preprint arXiv:2502.16806, 2025
-
[25]
Alignment Attention by Match- ing Key and Query Distributions,
Shujian Zhang, Xinjie Fan, Huangjie Zheng, Korawat Tan- wisuth, and Mingyuan Zhou, “Alignment Attention by Match- ing Key and Query Distributions,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 13444–13457, 2021
work page 2021
-
[26]
338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008
Cédric Villani,Optimal Transport: Old and New, vol. 338 ofGrundlehren der mathematischen Wissenschaften, Springer, 2008
work page 2008
-
[27]
Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, et al., “Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM,” 2023
work page 2023
-
[28]
Self-Instruct: Align- ing Language Models with Self-Generated Instructions,
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, et al., “Self-Instruct: Align- ing Language Models with Self-Generated Instructions,” in The 61st Annual Meeting of the ACL, 2023, pp. 13484–13508
work page 2023
-
[29]
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, et al., “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” 2023
work page 2023
-
[30]
Benchmarking generalization via in-context instructions on 1,600+ language tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, et al., “Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks,”ArXiv, vol. abs/2204.07705, 2022
-
[31]
Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick, “Unnatural Instructions: Tuning Language Models with (Al- most) No Human Labor,” inThe 61st Annual Meeting of the ACL, 2023, pp. 14409–14428
work page 2023
-
[32]
MiniLLM: Knowledge Distillation of Large Language Mod- els,
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang, “MiniLLM: Knowledge Distillation of Large Language Mod- els,” inThe 12th International Conference on Learning Repre- sentations, 2024
work page 2024
-
[33]
On Information and Suffi- ciency,
S. Kullback and R. A. Leibler, “On Information and Suffi- ciency,”The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951
work page 1951
-
[34]
DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun, “DistiLLM: Towards Streamlined Distillation for Large Lan- guage Models,” inThe 41st International Conference on Ma- chine Learning, 2024
work page 2024
-
[35]
Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,
Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong, “Rethinking Kullback-Leibler Diver- gence in Knowledge Distillation for Large Language Models,” inThe 31st International Conference on Computational Lin- guistics, 2025, pp. 5737–5755
work page 2025
-
[36]
The Jensen-Shannon Divergence,
M.L. Menéndez, J.A. Pardo, L. Pardo, and M.C. Pardo, “The Jensen-Shannon Divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997
work page 1997
-
[37]
ROUGE: A Package for Automatic Evaluation of Summaries,
Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” inText Summarization Branches Out, 2004, pp. 74–81
work page 2004
-
[38]
Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,
Cong Thanh Do, Rama Sanand Doddipatla, and Kate Knill, “Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models,” inThe 18th Inter- national Natural Language Generation Conference, 2025, pp. 833–845
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.