arxiv: 2605.01374 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

Pham Khanh Chi , Quoc Phong Dao , Thuat Nguyen , Linh Ngo Van , Trung Le , Thanh Hong Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationlarge language modelsrepresentation alignmentmulti-granulartrajectory alignmentdynamic structural alignment

0 comments

The pith

Multi-granular trajectory alignment improves knowledge distillation by matching teacher and student representations at word level in lower layers and phrase level in higher layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that existing knowledge distillation methods for large language models limit knowledge transfer because they align representations only at fixed layers or token-level outputs, ignoring how representations evolve across model depth. MTA addresses this by aligning along the layer-wise transformation trajectory with a layer-adaptive strategy that uses word-level spans in lower layers to preserve lexical information and phrase-level spans in higher layers to capture compositional semantics. A sympathetic reader would care because stronger internal relational structure in the student could yield smaller models that retain more of the teacher's capability without additional training cost. The approach is supported by a Dynamic Structural Alignment loss that matches relative geometry among semantic units and a Hidden Representation Alignment loss for direct layer matching, with experiments showing consistent gains over baselines.

Core claim

MTA shows that aligning teacher and student representations along their layer-wise transformation trajectory via a layer-adaptive multi-granular strategy, instantiated through a Dynamic Structural Alignment loss that matches relative geometry among semantic units and supplemented by Hidden Representation Alignment, enables the student to capture the teacher's evolving internal structure more effectively than fixed-layer or token-level methods.

What carries the argument

The Dynamic Structural Alignment loss, which matches the relative geometry among semantic units within each layer under a layer-adaptive word-to-phrase granularity switch.

If this is right

The student better captures compositional semantics in higher layers while preserving lexical details in lower layers.
Ablation studies confirm that both the dynamic structural component and the hidden representation component contribute to the observed gains.
The design aligns with linguistic principles in which higher-level meaning arises from composition of lower-level units.
Knowledge transfer becomes stronger because the student is guided by the teacher's full trajectory rather than isolated snapshots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-adaptive granularity principle could be tested in vision transformers or other hierarchical architectures where abstraction also increases with depth.
If the relative-geometry matching proves robust, it may reduce the need for very deep student models in resource-constrained settings.
The approach leaves open whether the same trajectory alignment can be applied during pre-training rather than only at distillation time.

Load-bearing premise

That the increasing abstraction of Transformer representations with depth makes word-level alignment optimal for lower layers and phrase-level alignment optimal for higher layers.

What would settle it

A controlled experiment in which a fixed-granularity alignment applied uniformly across all layers matches or exceeds MTA performance on the same benchmarks and model pairs would falsify the necessity of the multi-granular trajectory approach.

Figures

Figures reproduced from arXiv: 2605.01374 by Linh Ngo Van, Pham Khanh Chi, Quoc Phong Dao, Thanh Hong Nguyen, Thuat Nguyen, Trung Le.

**Figure 1.** Figure 1: The correspondence between linguistic compositionality and the layer-wise evolution of representations in large language models. between teacher and student representations. Together, these objectives constrain both the internal structure of representations within each layer and their transformation across depth. 3.1 Motivation: The Hierarchical Representational Trajectory Most existing Knowledge Distilla… view at source ↗

**Figure 2.** Figure 2: Dynamic Structural Alignment (LDSA). This objective enforces geometric consistency. It calculates the pairwise relational distances between semantic spans (words or phrases) within a layer for both Teacher and Student. By minimizing the discrepancy between these two topological structures across network depths, the Student learns to replicate the Teacher’s representational trajectory. This layer-adaptive d… view at source ↗

**Figure 3.** Figure 3: Hidden Representation Alignment Strategy. The student learns to match teacher representations using a weighted cosine distance objective, ensuring accurate feature reconstruction at key layers. R dS×dT to map the Student’s representations into the Teacher’s space: H˜ S t,l = HS t,lWl (12) We then minimize the weighted cosine distance between the projected Student states and the Teacher states: LHid = X l∈… view at source ↗

**Figure 4.** Figure 4: GPT-4o-mini evaluation scores (1-100) for view at source ↗

**Figure 5.** Figure 5: GPT-4o-mini evaluation scores (1-100) for view at source ↗

**Figure 6.** Figure 6: Effect of the number of distilled intermediate view at source ↗

**Figure 7.** Figure 7: Prompt for GPT-4 evaluation using Ground view at source ↗

read the original abstract

Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multi-Granular Trajectory Alignment (MTA) for knowledge distillation of large language models. It addresses limitations of fixed-layer or token-level alignment by matching teacher and student representations along their layer-wise transformation trajectories. The method uses a layer-adaptive multi-granular strategy—word-level alignment in lower layers to preserve lexical information and phrase-level spans (e.g., noun/verb phrases) in higher layers to capture compositional semantics—implemented via a Dynamic Structural Alignment loss that matches relative geometry among semantic units plus a Hidden Representation Alignment loss for direct layer matching. The approach is motivated by observations that Transformer representations grow more abstract with depth and by linguistic principles of compositional meaning. Experiments are reported to show consistent outperformance over state-of-the-art baselines on standard benchmarks, with ablations confirming each component's contribution.

Significance. If the empirical claims hold, MTA offers a principled extension of representation alignment in distillation that respects the depth-dependent evolution of abstraction in Transformers, potentially improving knowledge transfer for compressed models. The design integrates standard empirical findings on representation abstraction with linguistic compositionality in a coherent way. Ablations that isolate the contribution of the multi-granular trajectory and the two losses are a positive feature, as they allow direct evaluation of the central design choices. The absence of parameter-free derivations or machine-checked proofs is typical for this empirical subfield but does not detract from the potential utility if the gains are reproducible.

major comments (2)

[§4] §4 (Experiments): The central claim that MTA 'consistently outperforms state-of-the-art baselines on standard benchmarks' is load-bearing, yet the manuscript supplies no quantitative metrics (e.g., exact accuracy or perplexity deltas), baseline implementations, dataset splits, or error analysis/statistical significance tests. This prevents assessment of whether the data actually support the outperformance assertion and undermines reproducibility.
[§3.2] §3.2 (Dynamic Structural Alignment loss): The loss is defined to match 'relative geometry among semantic units,' but the precise formulation (e.g., how phrase-level spans are extracted, whether an external parser is used, and the exact distance or similarity metric) is not fully specified with equations or pseudocode. This detail is load-bearing for the multi-granular claim and for reproducing the reported gains.

minor comments (2)

[Abstract, §3] The abstract and §3 introduce 'Dynamic Structural Alignment loss' and 'Hidden Representation Alignment loss' without immediate equation references; adding forward pointers to the defining equations would improve readability.
[Figure 1] Figure 1 (trajectory diagram) would benefit from explicit annotation of the word-level vs. phrase-level alignment boundaries and the layer indices at which the switch occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to enhance reproducibility and clarity while preserving the core contributions of MTA.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim that MTA 'consistently outperforms state-of-the-art baselines on standard benchmarks' is load-bearing, yet the manuscript supplies no quantitative metrics (e.g., exact accuracy or perplexity deltas), baseline implementations, dataset splits, or error analysis/statistical significance tests. This prevents assessment of whether the data actually support the outperformance assertion and undermines reproducibility.

Authors: We agree that additional quantitative details and reproducibility information are essential. In the revised manuscript, we have expanded Section 4 with a comprehensive table reporting exact accuracy and perplexity values, performance deltas relative to each baseline, descriptions of baseline re-implementations (including hyperparameters), the precise dataset splits (e.g., standard GLUE/SuperGLUE partitions), and statistical significance results via paired bootstrap tests with 95% confidence intervals. These additions directly support the outperformance claims and address the reproducibility concerns. revision: yes
Referee: [§3.2] §3.2 (Dynamic Structural Alignment loss): The loss is defined to match 'relative geometry among semantic units,' but the precise formulation (e.g., how phrase-level spans are extracted, whether an external parser is used, and the exact distance or similarity metric) is not fully specified with equations or pseudocode. This detail is load-bearing for the multi-granular claim and for reproducing the reported gains.

Authors: We acknowledge the need for greater precision in the loss formulation. The revised Section 3.2 now includes the complete mathematical definition of the Dynamic Structural Alignment loss, specifying that phrase-level spans are identified via the spaCy dependency parser (with explicit rules for noun/verb phrases), that relative geometry is matched using normalized Euclidean distance on hidden states, and that the full alignment procedure is provided as pseudocode. These details make the multi-granular trajectory alignment fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes MTA as a new distillation framework that introduces layer-adaptive multi-granular alignment losses (Dynamic Structural Alignment on relative geometry plus Hidden Representation Alignment) motivated by standard empirical observations on Transformer abstraction with depth and linguistic compositionality principles. These losses are defined directly from the proposed architecture rather than being fitted to data and then renamed as predictions, and the central claims do not reduce to self-citations, self-definitions, or imported uniqueness theorems from the authors' prior work. The derivation chain is self-contained: the method is presented as an extension of existing representation alignment techniques with explicit design choices justified by cited external findings, without any step where a claimed result is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced alignment losses and the validity of layer-adaptive granularity assumptions drawn from external observations rather than first-principles derivation.

axioms (2)

domain assumption Transformer representations become increasingly abstract with depth
Invoked to justify word-level alignment in lower layers and phrase-level in higher layers
domain assumption Higher-level meaning emerges through the composition of lower-level lexical units
Cited as consistent with the linguistic view motivating the multi-granular design

invented entities (2)

Dynamic Structural Alignment loss no independent evidence
purpose: Matches the relative geometry among semantic units within each layer
Core new component of the MTA framework
Hidden Representation Alignment loss no independent evidence
purpose: Directly aligns selected teacher-student layers
Supplementary component to the trajectory alignment

pith-pipeline@v0.9.0 · 5523 in / 1388 out tokens · 90380 ms · 2026-05-09T14:50:59.346561+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 25 canonical work pages · 3 internal anchors

[1]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015
[2]

2020 , eprint=

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2020 , eprint=

2020
[3]

Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

Xiaoqi Jiao and Yichun Yin and Lifeng Shang and Xin Jiang and Xiao Chen and Linlin Li and Fang Wang and Qun Liu , title =. CoRR , volume =. 2019 , url =. 1909.10351 , timestamp =

work page arXiv 2019
[4]

2023 , eprint=

MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=

2023
[5]

2025 , eprint=

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models , author=. 2025 , eprint=

2025
[6]

2024 , eprint=

C-Pack: Packed Resources For General Chinese Embeddings , author=. 2024 , eprint=

2024
[7]

2025 , eprint=

Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=

2025
[8]

2019 , eprint=

Patient Knowledge Distillation for BERT Model Compression , author=. 2019 , eprint=

2019
[9]

2024 , eprint=

MiniLLM: Knowledge Distillation of Large Language Models , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , author=. 2025 , eprint=

2025
[11]

2024 , eprint=

Knowledge Fusion of Large Language Models , author=. 2024 , eprint=

2024
[12]

2024 , eprint=

Dual-Space Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

2024
[13]

2020 , eprint=

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , author=. 2020 , eprint=

2020
[14]

2025 , eprint=

A Survey of Large Language Models , author=. 2025 , eprint=

2025
[15]

2020 , eprint=

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , author=. 2020 , eprint=

2020
[16]

2023 , eprint=

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings , author=. 2023 , eprint=

2023
[17]

2016 , eprint=

Sequence-Level Knowledge Distillation , author=. 2016 , eprint=

2016
[18]

2023 , eprint=

Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=

2023
[19]

2023 , eprint=

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. 2023 , eprint=

2023
[20]

2019 , eprint=

Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=

2019
[21]

2024 , eprint=

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning , author=. 2024 , eprint=

2024
[22]

2025 , eprint=

CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers , author=. 2025 , eprint=

2025
[23]

and Szedmak, Sandor and Shawe-Taylor, John , journal=

Hardoon, David R. and Szedmak, Sandor and Shawe-Taylor, John , journal=. Canonical Correlation Analysis: An Overview with Application to Learning Methods , year=
[24]

2023 , eprint=

Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring , author=. 2023 , eprint=

2023
[25]

2025 , eprint=

Rho-1: Not All Tokens Are What You Need , author=. 2025 , eprint=

2025
[26]

Context-aware Event Forecasting via Graph Disentanglement , booktitle =

Li, Junyan and Zhang, Li Lyna and Xu, Jiahang and Wang, Yujing and Yan, Shaoguang and Xia, Yunqing and Yang, Yuqing and Cao, Ting and Sun, Hao and Deng, Weiwei and Zhang, Qi and Yang, Mao , year=. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference , url=. doi:10.1145/3580305.3599284 , booktitle=

work page doi:10.1145/3580305.3599284
[27]

2021 , eprint=

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. 2021 , eprint=

2021
[28]

33rd British Machine Vision Conference 2022,

Aninda Saha and Alina N Bialkowski and Sara Khalifa , title =. 33rd British Machine Vision Conference 2022,. 2022 , url =

2022
[29]

The Thirteenth International Conference on Learning Representations , year=

Improving Language Model Distillation through Hidden State Matching , author=. The Thirteenth International Conference on Learning Representations , year=
[30]

BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization

Sharma, Eva and Li, Chen and Wang, Lu. BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1212

work page doi:10.18653/v1/p19-1212 2019
[31]

Tushar Khot and Ashish Sabharwal and Peter Clark , Booktitle =
[32]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i15.17580 , number=

work page doi:10.1609/aaai.v35i15.17580 2021
[33]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Adversarial NLI: A New Benchmark for Natural Language Understanding , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

2020
[34]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[35]

2024 , url=

Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

2024
[36]

S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto. S em E val-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E v...

work page doi:10.3115/v1/s14-2001 2014
[37]

2024 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

2024
[38]

URLhttps://doi.org/10.3389/neuro.06.004.2008

Kriegeskorte, Nikolaus and Mur, Marieke and Bandettini, Peter A. , TITLE=. Frontiers in Systems Neuroscience , VOLUME=. 2008 , URL=. doi:10.3389/neuro.06.004.2008 , ISSN=

work page doi:10.3389/neuro.06.004.2008 2008
[39]

2023 , url=

Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , url=

2023
[40]

arXiv preprint arXiv:2508.12519 , year=

An Introduction to Sliced Optimal Transport , author=. arXiv preprint arXiv:2508.12519 , year=

work page arXiv
[41]

Improving Vietnamese-English Cross-Lingual Retrieval for Legal and General Domains , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

2025
[42]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mitigating Non-Representative Prototypes and Representation Bias in Few-Shot Continual Relation Extraction , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[43]

Enhancing Discriminative Representation in Similar Relation Clusters for Few-Shot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[44]

Mutual-pairing Data Augmentation for Fewshot Continual Relation Extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[45]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

HiCOT: Improving Neural Topic Models via Optimal Transport and Contrastive Learning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[46]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Topic Modeling for Short Texts via Optimal Transport-Based Clustering , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[47]

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Understanding and improving transformer from a multi-particle dynamic system point of view , author=. arXiv preprint arXiv:1906.02762 , year=

work page Pith review arXiv 1906
[48]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[49]

2009 , publisher=

Systemic functional grammar: A first step into the theory , author=. 2009 , publisher=

2009
[50]

2015 , publisher=

Lexical-functional syntax , author=. 2015 , publisher=

2015
[51]

2019 , eprint=

What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=

2019
[52]

Lifting the Curse of Capacity Gap in Distilling Language Models

Zhang, Chen and Yang, Yang and Liu, Jiahao and Wang, Jingang and Xian, Yunsen and Wang, Benyou and Song, Dawei. Lifting the Curse of Capacity Gap in Distilling Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

2023
[53]

AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression

Wu, Siyue and Chen, Hongzhan and Quan, Xiaojun and Wang, Qifan and Wang, Rui. AD - KD : Attribution-Driven Knowledge Distillation for Language Model Compression. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

2023
[54]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Multi-granularity structural knowledge distillation for language model compression , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[55]

2025 , eprint=

Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping , author=. 2025 , eprint=

2025
[56]

2023 , eprint =

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization , author =. 2023 , eprint =. doi:10.48550/arXiv.2308.03275 , url =

work page doi:10.48550/arxiv.2308.03275 2023
[57]

2024 , eprint =

VkD: Improving Knowledge Distillation using Orthogonal Projections , author =. 2024 , eprint =. doi:10.48550/arXiv.2403.06213 , url =

work page doi:10.48550/arxiv.2403.06213 2024
[58]

1970 , publisher=

An introduction to celestial mechanics , author=. 1970 , publisher=

1970
[59]

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Udagawa, Takuma and Trivedi, Aashka and Merler, Michele and Bhattacharjee, Bishwaranjan. A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023

2023
[60]

2024 , eprint =

DeepSeek-V3 Technical Report , author =. 2024 , eprint =

2024
[61]

2024 , eprint =

GPT-4 Technical Report , author =. 2024 , eprint =

2024
[62]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2018 , url =

2018
[63]

2025 , eprint =

Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author =. 2025 , eprint =. doi:10.48550/arXiv.2503.20083 , url =

work page doi:10.48550/arxiv.2503.20083 2025
[64]

Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...

work page doi:10.18653/v1/2022.emnlp-main.340 2022
[65]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna:
[66]

D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset

Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

2021
[67]

and Khashabi, Daniel and Hajishirzi, Hannaneh

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. S elf- I nstruct: Aligning Language Models with S elf- G enerated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl...

work page doi:10.18653/v1/2023.acl-long.754 2023
[68]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=. 2019 , url =

2019
[69]

2024 , url =

Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=. 2024 , url =

2024
[70]

Qwen Technical Report

Qwen Technical Report , author =. arXiv preprint arXiv:2309.16609 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[71]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , year=. 2205.01068 , archivePrefix=

work page internal anchor Pith review arXiv
[72]

Identifying and Mitigating Vulnerabilities in

Jiang, Fengqing and Xu, Zhangchen and Niu, Luyao and Wang, Boxin and Jia, Jinyuan and Li, Bo and Poovendran, Radha , journal =. Identifying and Mitigating Vulnerabilities in. 2023 , month = nov, doi =

2023
[73]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[75]

2024 , eprint=

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models , author=. 2024 , eprint=

2024
[76]

2023 , eprint=

f-Divergence Minimization for Sequence-Level Knowledge Distillation , author=. 2023 , eprint=

2023
[77]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

DistiLLM: Towards Streamlined Distillation for Large Language Models , author=. arXiv preprint arXiv:2402.03898 , year=

work page arXiv
[78]

Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs , author=. arXiv preprint arXiv:2503.07067 , year=

work page arXiv
[79]

Computational Linguistics , volume=

Automatic Labeling of Semantic Roles , author=. Computational Linguistics , volume=
[80]

Computational Linguistics , volume=

The Proposition Bank: An Annotated Corpus of Semantic Roles , author=. Computational Linguistics , volume=

Showing first 80 references.