The authors introduce DSKD-CMA-GA using generative adversarial learning to fix key-query distribution mismatches in cross-tokenizer knowledge distillation, reporting modest average ROUGE-L gains of 0.37 especially on out-of-distribution data.
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch
The authors introduce DSKD-CMA-GA using generative adversarial learning to fix key-query distribution mismatches in cross-tokenizer knowledge distillation, reporting modest average ROUGE-L gains of 0.37 especially on out-of-distribution data.