Recognition: unknown
SRA: Span Representation Alignment for Large Language Model Distillation
Pith reviewed 2026-05-09 15:18 UTC · model grok-4.3
The pith
SRA shifts LLM distillation alignment from tokens to attention-weighted span centers of mass for better cross-tokenizer transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SRA reframes cross-tokenizer knowledge distillation by moving the alignment target from individual tokens to robust spans, each represented by its attention-weighted center of mass under a multi-particle dynamical systems model, and demonstrates that this produces representations that are more stable across tokenizers and yield stronger distillation performance than token-level baselines.
What carries the argument
The span center of mass, defined as the attention-weighted average of token representations within a span and treated as the state of a particle cluster in a multi-particle dynamical system.
If this is right
- Distillation performance becomes less dependent on the exact token boundaries chosen by each model's tokenizer.
- Attention weighting focuses alignment on the most salient spans, preserving semantic content that would be diluted at the token level.
- The geometric regularizer maintains structural consistency in the shared representation space during transfer.
- Adding aligned span logit distillation supplies an extra channel for knowledge transfer beyond representation matching alone.
Where Pith is reading between the lines
- The same span-center approach could be tested on other cross-model tasks such as retrieval or translation where tokenizers also differ.
- If the particle-cluster framing is useful, it might suggest treating attention heads themselves as dynamical systems whose equilibria can be aligned directly.
- The method may scale to distillation involving multimodal models where spans could be defined over image patches or audio segments as well.
Load-bearing premise
Modeling spans as particle clusters and using their attention-weighted centers of mass produces representations that remain robust to tokenizer mismatch and carry more useful information for distillation than token-level aggregation.
What would settle it
Re-running the reported cross-architecture distillation experiments but replacing the attention-weighted span center of mass with either token-level alignment or non-attention-weighted span averages, and checking whether the performance gap over CTKD baselines disappears.
Figures
read the original abstract
Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SRA, a framework for cross-tokenizer knowledge distillation (CTKD) that reframes alignment through multi-particle dynamical systems. It shifts from token-level to span-level representations, where each span is modeled as a cluster of particles whose state is captured by an attention-weighted center of mass (CoM). The method adds a geometric regularizer to maintain structural properties of the representation space and aligned span-logit distillation for improved transfer. The central empirical claim is that SRA consistently and significantly outperforms state-of-the-art CTKD baselines in cross-architecture distillation experiments.
Significance. If the reported gains prove robust, SRA could offer a practical advance for distilling knowledge between LLMs with mismatched tokenizers and architectures by using higher-level, semantically richer alignment units. The physical-systems framing provides intuitive motivation for the CoM construction and regularizer, and the combination of components addresses a known brittleness in token-level CTKD. Reproducibility would be strengthened by the explicit empirical validation against baselines.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.
- [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.
minor comments (2)
- [Abstract] Abstract: The phrase 'challenging cross-architecture distillation experiments' should name the specific teacher-student architecture pairs and datasets to allow immediate assessment of the claim's scope.
- [Notation] Notation: Ensure consistent use of symbols for spans, CoM, and the geometric regularizer across sections; a table summarizing all hyperparameters would aid clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification and additional empirical support. We address each major comment point by point below, indicating the revisions we will incorporate in the updated version.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): The abstract asserts consistent and significant outperformance over CTKD baselines, yet no quantitative metrics, error bars, dataset specifications, model pairs, or ablation results on span selection, CoM weighting, or the geometric regularizer are supplied. These details are load-bearing for evaluating whether the gains exceed what could be achieved by standard aggregation functions or post-hoc tuning.
Authors: We appreciate the referee's emphasis on empirical rigor. While Section 4 reports performance numbers on cross-architecture pairs (e.g., Llama-2 to Mistral and similar), we acknowledge that error bars, explicit dataset/model tables, and component ablations were not sufficiently detailed. In the revision we will add: (i) mean and standard deviation over three random seeds for all main results, (ii) a summary table listing exact datasets, model sizes, and tokenizer vocabularies, and (iii) ablation tables isolating span selection heuristics, attention-based CoM weighting, and the geometric regularizer. These additions will directly address whether the observed gains exceed those obtainable from simpler aggregation baselines or post-hoc tuning. revision: yes
-
Referee: [§3.2 (Center of Mass formulation)] §3.2 (Center of Mass formulation): The CoM is defined as an attention-weighted average of tokens within a span, but the precise normalization of attention weights, handling of cross-tokenizer span boundaries, and any free parameters in the weighting scheme are not specified. Without this, it is unclear whether the method is truly tokenizer-agnostic or reduces to a fitted aggregation that could be replicated without the multi-particle framing.
Authors: We agree that the current description in §3.2 lacks sufficient mathematical detail. The attention weights are normalized with a softmax taken exclusively over the tokens belonging to each span (ensuring they sum to one). Span boundaries are aligned across tokenizers by first recovering word-level segments from the original text via a deterministic detokenization step, then projecting those segments onto each model's subword sequence; this mapping uses no learned parameters. The weighting itself is taken directly from the teacher's attention heads with no additional hyperparameters. We will revise §3.2 to include the explicit normalized CoM equation, the word-level alignment procedure, and pseudocode, thereby clarifying that the construction is tokenizer-agnostic and motivated by the multi-particle analogy rather than being an arbitrary fitted aggregator. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent experimental validation
full rationale
The paper defines SRA via an explicit modeling choice (attention-weighted span CoM under a multi-particle analogy) and reports empirical gains on cross-architecture distillation benchmarks. No equations, uniqueness theorems, or self-citations are shown that reduce the reported performance to a fitted parameter or to the input data by construction. The physical framing functions as interpretive motivation for the aggregation unit; success is measured by downstream distillation metrics rather than by any internal identity or self-referential prediction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2015 , eprint=
Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=
2015
-
[2]
2020 , eprint=
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2020 , eprint=
2020
-
[3]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao and Yichun Yin and Lifeng Shang and Xin Jiang and Xiao Chen and Linlin Li and Fang Wang and Qun Liu , title =. CoRR , volume =. 2019 , url =. 1909.10351 , timestamp =
-
[4]
2023 , eprint=
MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=
2023
-
[5]
2025 , eprint=
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models , author=. 2025 , eprint=
2025
-
[6]
2024 , eprint=
C-Pack: Packed Resources For General Chinese Embeddings , author=. 2024 , eprint=
2024
-
[7]
2025 , eprint=
Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=
2025
-
[8]
2019 , eprint=
Patient Knowledge Distillation for BERT Model Compression , author=. 2019 , eprint=
2019
-
[9]
2024 , eprint=
MiniLLM: Knowledge Distillation of Large Language Models , author=. 2024 , eprint=
2024
-
[10]
2025 , eprint=
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , author=. 2025 , eprint=
2025
-
[11]
2024 , eprint=
Knowledge Fusion of Large Language Models , author=. 2024 , eprint=
2024
-
[12]
2024 , eprint=
Dual-Space Knowledge Distillation for Large Language Models , author=. 2024 , eprint=
2024
-
[13]
2020 , eprint=
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , author=. 2020 , eprint=
2020
-
[14]
2025 , eprint=
A Survey of Large Language Models , author=. 2025 , eprint=
2025
-
[15]
2020 , eprint=
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , author=. 2020 , eprint=
2020
-
[16]
2023 , eprint=
DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings , author=. 2023 , eprint=
2023
-
[17]
2016 , eprint=
Sequence-Level Knowledge Distillation , author=. 2016 , eprint=
2016
-
[18]
2023 , eprint=
Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=
2023
-
[19]
2023 , eprint=
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. 2023 , eprint=
2023
-
[20]
2019 , eprint=
Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=
2019
-
[21]
2024 , eprint=
KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning , author=. 2024 , eprint=
2024
-
[22]
2025 , eprint=
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers , author=. 2025 , eprint=
2025
-
[23]
and Szedmak, Sandor and Shawe-Taylor, John , journal=
Hardoon, David R. and Szedmak, Sandor and Shawe-Taylor, John , journal=. Canonical Correlation Analysis: An Overview with Application to Learning Methods , year=
-
[24]
2023 , eprint=
Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring , author=. 2023 , eprint=
2023
-
[25]
2025 , eprint=
Rho-1: Not All Tokens Are What You Need , author=. 2025 , eprint=
2025
-
[26]
Context-aware Event Forecasting via Graph Disentanglement , booktitle =
Li, Junyan and Zhang, Li Lyna and Xu, Jiahang and Wang, Yujing and Yan, Shaoguang and Xia, Yunqing and Yang, Yuqing and Cao, Ting and Sun, Hao and Deng, Weiwei and Zhang, Qi and Yang, Mao , year=. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference , url=. doi:10.1145/3580305.3599284 , booktitle=
-
[27]
2021 , eprint=
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. 2021 , eprint=
2021
-
[28]
33rd British Machine Vision Conference 2022,
Aninda Saha and Alina N Bialkowski and Sara Khalifa , title =. 33rd British Machine Vision Conference 2022,. 2022 , url =
2022
-
[29]
The Thirteenth International Conference on Learning Representations , year=
Improving Language Model Distillation through Hidden State Matching , author=. The Thirteenth International Conference on Learning Representations , year=
-
[30]
BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization
Sharma, Eva and Li, Chen and Wang, Lu. BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1212
-
[31]
Tushar Khot and Ashish Sabharwal and Peter Clark , Booktitle =
-
[32]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i15.17580 , number=
-
[33]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Adversarial NLI: A New Benchmark for Natural Language Understanding , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
2020
-
[34]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[35]
2024 , url=
Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=
2024
-
[36]
Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto. S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E v...
-
[37]
2024 , eprint=
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=
2024
-
[38]
URLhttps://doi.org/10.3389/neuro.06.004.2008
Kriegeskorte, Nikolaus and Mur, Marieke and Bandettini, Peter A. , TITLE=. Frontiers in Systems Neuroscience , VOLUME=. 2008 , URL=. doi:10.3389/neuro.06.004.2008 , ISSN=
-
[39]
2023 , url=
Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , url=
2023
-
[40]
arXiv preprint arXiv:2508.12519 , year=
An Introduction to Sliced Optimal Transport , author=. arXiv preprint arXiv:2508.12519 , year=
-
[41]
Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=
2025
-
[42]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Mitigating Non-Representative Prototypes and Representation Bias in Few-Shot Continual Relation Extraction , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[43]
Enhancing Discriminative Representation in Similar Relation Clusters for Few-Shot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[44]
Mutual-pairing Data Augmentation for Fewshot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[45]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
HiCOT: Improving Neural Topic Models via Optimal Transport and Contrastive Learning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[46]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Topic Modeling for Short Texts via Optimal Transport-Based Clustering , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[47]
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
Understanding and improving transformer from a multi-particle dynamic system point of view , author=. arXiv preprint arXiv:1906.02762 , year=
work page Pith review arXiv 1906
-
[48]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[49]
2009 , publisher=
Systemic functional grammar: A first step into the theory , author=. 2009 , publisher=
2009
-
[50]
2015 , publisher=
Lexical-functional syntax , author=. 2015 , publisher=
2015
-
[51]
2019 , eprint=
What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=
2019
-
[52]
Lifting the Curse of Capacity Gap in Distilling Language Models
Zhang, Chen and Yang, Yang and Liu, Jiahao and Wang, Jingang and Xian, Yunsen and Wang, Benyou and Song, Dawei. Lifting the Curse of Capacity Gap in Distilling Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
2023
-
[53]
AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression
Wu, Siyue and Chen, Hongzhan and Quan, Xiaojun and Wang, Qifan and Wang, Rui. AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
2023
-
[54]
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Multi-granularity structural knowledge distillation for language model compression , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[55]
2025 , eprint=
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping , author=. 2025 , eprint=
2025
-
[56]
Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization , author =. 2023 , eprint =. doi:10.48550/arXiv.2308.03275 , url =
-
[57]
VkD: Improving Knowledge Distillation using Orthogonal Projections , author =. 2024 , eprint =. doi:10.48550/arXiv.2403.06213 , url =
-
[58]
1970 , publisher=
An introduction to celestial mechanics , author=. 1970 , publisher=
1970
-
[59]
A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Udagawa, Takuma and Trivedi, Aashka and Merler, Michele and Bhattacharjee, Bishwaranjan. A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023
2023
-
[60]
2024 , eprint =
DeepSeek-V3 Technical Report , author =. 2024 , eprint =
2024
-
[61]
2024 , eprint =
GPT-4 Technical Report , author =. 2024 , eprint =
2024
-
[62]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2018 , url =
2018
-
[63]
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author =. 2025 , eprint =. doi:10.48550/arXiv.2503.20083 , url =
-
[64]
Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...
-
[65]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna:
-
[66]
D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset
Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021
2021
-
[67]
and Khashabi, Daniel and Hajishirzi, Hannaneh
Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. S elf- I nstruct: Aligning Language Models with S elf- G enerated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl...
-
[68]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=. 2019 , url =
2019
-
[69]
2024 , url =
Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=. 2024 , url =
2024
-
[70]
Qwen Technical Report , author =. arXiv preprint arXiv:2309.16609 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , year=. 2205.01068 , archivePrefix=
work page internal anchor Pith review arXiv
-
[72]
Identifying and Mitigating Vulnerabilities in
Jiang, Fengqing and Xu, Zhangchen and Niu, Luyao and Wang, Boxin and Jia, Jinyuan and Li, Bo and Poovendran, Radha , journal =. Identifying and Mitigating Vulnerabilities in. 2023 , month = nov, doi =
2023
-
[73]
Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[75]
2024 , eprint=
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models , author=. 2024 , eprint=
2024
-
[76]
2024 , eprint=
DistiLLM: Towards Streamlined Distillation for Large Language Models , author=. 2024 , eprint=
2024
-
[77]
2023 , eprint=
f-Divergence Minimization for Sequence-Level Knowledge Distillation , author=. 2023 , eprint=
2023
-
[78]
Findings of the Association for Computational Linguistics: EACL , pages=
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation , author=. Findings of the Association for Computational Linguistics: EACL , pages=
-
[79]
Fortieth
Truong Nguyen and Phi Van Dat and Ngan Nguyen and Linh Ngo Van and Trung Le and Thanh Hong Nguyen , title =. Fortieth
-
[80]
Fortieth
Hoang Tran Vuong and Tue Le and Quyen Tran and Linh Ngo Van and Trung Le , title =. Fortieth
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.