arxiv: 2605.01058 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

Shashank Kapadia , Deep Naryan Mishra , Sujal Reddy Alugubelli , Haoan Wang , Saipraveen Vabbilisetty , Rishi Bhatia , Anupriya Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords early exitknowledge distillationtransformer inferencecomputational efficiencylayer-wise trainingpretraining objectiveinference accelerationmodel compression

0 comments

The pith

Distillation suppresses layer convergence needed for early exits, but an auxiliary constraint during training restores effective acceleration without architecture changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that standard layer-aligned distillation prevents representations from converging across layers in the way early-exit mechanisms require to safely stop computation early. This makes early exits produce no real speedup on distilled models under normal conditions. LEAP augments the training process with one additional objective that forces each intermediate layer to approximate the final layer's representation. The result is that early exits become usable again, delivering measured wall-clock speedups on hardware while task performance on similarity and retrieval benchmarks stays intact and no model changes are required.

Core claim

Layer-aligned distillation and convergence-based early exit are incompatible because distillation objectives suppress the representational convergence across layers that early-exit mechanisms exploit. This incompatibility is reconciled by augmenting standard distillation with an auxiliary objective that ensures intermediate layers approximate final-layer representations, resulting in effective early exits and speedups such as 1.61 times wall-clock improvement where standard distilled models achieve none.

What carries the argument

The LEAP auxiliary training objective that adds a single constraint ensuring intermediate layers approximate final-layer representations during pretraining.

If this is right

Early-exit mechanisms become viable on distilled models, supporting up to 1.80 times theoretical layer reduction.
91.9 percent of samples exit by layer 7 at a 0.95 confidence threshold, producing 1.61 times measured wall-clock speedup on NVIDIA L4 hardware for batch size 1.
Task performance holds steady on STS-B at 0.760 plus or minus 0.006 and across BEIR retrieval benchmarks.
No architectural modifications are needed, so the approach applies directly to existing distilled models.
Standard distillation without the constraint produces zero effective speedup from early exits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training with this constraint could enable inference systems that adapt the number of layers used per input based on its difficulty.
The same layer-matching idea might resolve conflicts between distillation and other optimizations that depend on consistent layer behavior.
Such constraints could be adopted as a default step when creating efficient transformer checkpoints for varied deployment settings.

Load-bearing premise

The added constraint during training reconciles the incompatibility between distillation and early exits without substantially harming the quality of the distilled representations or the performance on downstream tasks.

What would settle it

Applying the LEAP constraint during training yet still observing zero effective speedup from early exits on hardware, or seeing a clear drop in accuracy on STS-B or BEIR benchmarks relative to standard distillation.

Figures

Figures reproduced from arXiv: 2605.01058 by Anupriya Sharma, Deep Naryan Mishra, Haoan Wang, Rishi Bhatia, Saipraveen Vabbilisetty, Shashank Kapadia, Sujal Reddy Alugubelli.

**Figure 2.** Figure 2: Per-layer exit distribution at θ=0.95. Peak at layer 7; 91.9% cumulative exit by L7. 4.4 Exit Distribution [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pareto curve: layer reduction vs. expected [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Contraction analysis: (a) Per-layer contraction [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: NN@10 failure analysis per exit layer (layers [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit systematic incompatibility under standard deployment conditions for convergence-based early exit. Distillation objectives that align intermediate student layers to teacher representations suppress the representational convergence that early-exit mechanisms exploit, rendering such mechanisms ineffective on distilled models. We introduce LEAP (Layer-wise Exit-Aware Pretraining), an auxiliary training objective that reconciles this incompatibility. LEAP requires no architectural modifications; it augments standard distillation with a single constraint ensuring intermediate layers approximate final-layer representations. LEAP-MiniLM achieves 1.61$\times$ measured wall-clock speedup (batch=1, NVIDIA L4) at $\theta$=0.95, with 91.9% of samples exiting by layer 7 and 1.80$\times$ theoretical layer reduction, where standard distilled models achieve zero effective speedup. We validate across sentence similarity (STS-B: 0.760 $\pm$ 0.006) and retrieval benchmarks (BEIR), providing operational guidance including latency measurements, decision thresholds, and deployment criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that layer-aligned distillation and convergence-based early exit are systematically incompatible because standard distillation objectives suppress the representational convergence (intermediate layers approximating the final layer) that early-exit mechanisms rely on, resulting in zero effective speedup on distilled models. It introduces LEAP as an auxiliary pretraining objective that adds a single constraint enforcing this convergence without architectural changes. On LEAP-MiniLM it reports 1.61× measured wall-clock speedup (batch=1, NVIDIA L4) at θ=0.95 with 91.9% of samples exiting by layer 7 and 1.80× theoretical layer reduction, while standard distilled models achieve zero speedup; results are validated on STS-B (0.760 ± 0.006) and BEIR benchmarks together with operational guidance on thresholds and deployment.

Significance. If the incompatibility is causal and LEAP restores convergence without degrading distillation quality, the result is significant: it would allow two dominant efficiency paradigms to be combined on the many already-distilled transformer models in deployment, yielding practical inference speedups with no architecture changes. The concrete latency numbers, exit statistics, and error bars on downstream tasks provide actionable deployment information.

major comments (3)

[Abstract] Abstract: the central claim that distillation suppresses representational convergence (and thereby nullifies early exit) is evidenced only by the reported zero speedup on standard distilled models; without direct measurements of layer-to-final representational similarity (e.g., cosine similarity) or an ablation that isolates this mechanism from other factors, the causal attribution cannot be confirmed.
[Abstract] Abstract: the LEAP auxiliary constraint is asserted to reconcile the incompatibility 'without meaningfully degrading distillation quality or downstream task performance,' yet no quantitative comparison of the original distillation loss with versus without the LEAP term, nor an ablation removing the constraint, is provided to support this.
[Experimental results] Experimental results: the contrast between 1.61× speedup / 91.9% early exits for LEAP-MiniLM and zero effective speedup for standard models would be strengthened by controls that vary the exit threshold θ and the precise exit decision rule, to rule out sensitivity to these hyperparameters as an alternative explanation for the observed difference.

minor comments (2)

The precise mathematical form of the LEAP constraint (distance metric, weighting, and how it is combined with the distillation loss) is not stated explicitly enough for reproduction.
The definition of the exit decision criterion (confidence, entropy, or other) used with threshold θ should be stated clearly in the methods or experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that additional direct evidence and controls will strengthen the manuscript. We address each major comment below and will incorporate the suggested analyses in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that distillation suppresses representational convergence (and thereby nullifies early exit) is evidenced only by the reported zero speedup on standard distilled models; without direct measurements of layer-to-final representational similarity (e.g., cosine similarity) or an ablation that isolates this mechanism from other factors, the causal attribution cannot be confirmed.

Authors: We agree that direct measurements of layer-to-final representational similarity would provide stronger causal evidence beyond the observed zero speedup. In the revised manuscript, we will add cosine similarity plots between intermediate-layer and final-layer representations for both standard distilled models and LEAP-MiniLM. We will also include an ablation isolating the convergence constraint to confirm its role in restoring the necessary representational properties for early exit. revision: yes
Referee: [Abstract] Abstract: the LEAP auxiliary constraint is asserted to reconcile the incompatibility 'without meaningfully degrading distillation quality or downstream task performance,' yet no quantitative comparison of the original distillation loss with versus without the LEAP term, nor an ablation removing the constraint, is provided to support this.

Authors: We acknowledge the value of quantitative comparisons to support the claim of no meaningful degradation. In the revision, we will add (i) training curves and final values comparing the primary distillation loss with and without the LEAP term, and (ii) an ablation removing the convergence constraint, reporting effects on both convergence metrics and downstream performance (STS-B and BEIR). revision: yes
Referee: [Experimental results] Experimental results: the contrast between 1.61× speedup / 91.9% early exits for LEAP-MiniLM and zero effective speedup for standard models would be strengthened by controls that vary the exit threshold θ and the precise exit decision rule, to rule out sensitivity to these hyperparameters as an alternative explanation for the observed difference.

Authors: We agree that varying the exit threshold and decision rules will help rule out hyperparameter sensitivity. In the revised manuscript, we will report speedup, exit statistics, and task performance for a range of θ values (0.80–0.99) and alternative exit decision rules (e.g., different confidence thresholds) applied to both LEAP-MiniLM and standard distilled models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The paper presents LEAP as an explicit auxiliary training constraint added to standard distillation, with the central claims resting on measured wall-clock speedups, exit statistics, and downstream task scores obtained after training. These outcomes are not algebraically forced by the definition of the LEAP objective itself; they are reported as results of concrete training runs and inference benchmarks. No equations reduce the claimed incompatibility or speedup to quantities defined solely by the method, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or self-citation chain is invoked to justify the core result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of incompatibility between two existing paradigms and the effectiveness of one added constraint. No new free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)

domain assumption Standard assumptions of transformer distillation and early-exit mechanisms remain valid.
The work builds directly on layer-aligned distillation and convergence-based early exit without questioning their foundational premises.

pith-pipeline@v0.9.0 · 5533 in / 1392 out tokens · 35391 ms · 2026-05-09T19:26:45.627337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[2]

D ee BERT : Dynamic Early Exiting for Accelerating BERT Inference

Xin, Ji and Tang, Raphael and Lee, Jaejun and Yu, Yaoliang and Lin, Jimmy. D ee BERT : Dynamic Early Exiting for Accelerating BERT Inference. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.204

work page doi:10.18653/v1/2020.acl-main.204 2020
[3]

F ast BERT : a Self-distilling BERT with Adaptive Inference Time

Liu, Weijie and Zhou, Peng and Wang, Zhe and Zhao, Zhi and Deng, Haotang and Ju, Qi. F ast BERT : a Self-distilling BERT with Adaptive Inference Time. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.537

work page doi:10.18653/v1/2020.acl-main.537 2020
[4]

BERT Loses Patience: Fast and Robust Inference with Early Exit

Zhou, Wangchunshu and Xu, Canwen and Ge, Tao and McAuley, Julian and Xu, Ke and Wei, Furu. BERT Loses Patience: Fast and Robust Inference with Early Exit. Advances in Neural Information Processing Systems. 2020

2020
[5]

BER x IT : Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

Xin, Ji and Tang, Raphael and Yu, Yaoliang and Lin, Jimmy. BER x IT : Early Exiting for BERT with Better Fine-Tuning and Extension to Regression. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021. doi:10.18653/v1/2021.eacl-main.8

work page doi:10.18653/v1/2021.eacl-main.8 2021
[6]

L ee BERT : Learned Early Exit for BERT with Cross-Level Optimization

Zhu, Wei. L ee BERT : Learned Early Exit for BERT with Cross-Level Optimization. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 2021

2021
[7]

and Tay, Yi and Metzler, Donald

Schuster, Tal and Fisch, Adam and Gupta, Jai and Dehghani, Mostafa and Bahri, Dara and Tran, Vinh Q. and Tay, Yi and Metzler, Donald. Confident Adaptive Language Modeling. Advances in Neural Information Processing Systems. 2022

2022
[8]

The Right Tool for the Job: Matching Model and Instance Complexities

Schwartz, Roy and Stanovsky, Gabriel and Swayamdipta, Swabha and Dodge, Jesse and Smith, Noah A. The Right Tool for the Job: Matching Model and Instance Complexities. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.593

work page doi:10.18653/v1/2020.acl-main.593 2020
[9]

A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation

Sun, Tianxiang and Shao, Yunhua and Li, Xiangyang and Liu, Ping and Yan, Hang and Qiu, Xipeng and Huang, Xuanjing. A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation. Findings of the Association for Computational Linguistics: ACL 2022. 2022

2022
[10]

ALBERT : A Lite BERT for Self-supervised Learning of Language Representations

Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu. ALBERT : A Lite BERT for Self-supervised Learning of Language Representations. International Conference on Learning Representations. 2020

2020
[11]

C ascade BERT : Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade

Li, Lei and Lin, Yankai and Chen, Deli and Ren, Shuhuai and Li, Peng and Zhou, Jie and Sun, Xu. C ascade BERT : Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021

2021
[12]

D yna BERT : Dynamic BERT with Adaptive Width and Depth

Hou, Lu and Huang, Zhiqi and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Liu, Qun. D yna BERT : Dynamic BERT with Adaptive Width and Depth. Advances in Neural Information Processing Systems. 2020

2020
[13]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas. D istil BERT , a distilled version of BERT : smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. 2019

work page internal anchor Pith review arXiv 1910
[14]

TinyBERT: Distilling BERT for natural language understanding

Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun. T iny BERT : Distilling BERT for Natural Language Understanding. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.372

work page doi:10.18653/v1/2020.findings-emnlp.372 2020
[15]

M ini LM : Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming. M ini LM : Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems. 2020

2020
[16]

M ini LM v2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

Wang, Wenhui and Bao, Hangbo and Huang, Shaohan and Dong, Li and Wei, Furu. M ini LM v2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. Findings of the Association for Computational Linguistics: ACL 2021. 2021

2021
[17]

Distilling the Knowledge in a Neural Network

Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Patient Knowledge Distillation for BERT Model Compression

Sun, Siqi and Cheng, Yu and Gan, Zhe and Liu, Jingjing. Patient Knowledge Distillation for BERT Model Compression. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

2019
[19]

Adaptive Computation Time for Recurrent Neural Networks

Graves, Alex. Adaptive Computation Time for Recurrent Neural Networks. arXiv preprint arXiv:1603.08983. 2016

work page internal anchor Pith review arXiv 2016
[20]

Universal Transformers

Dehghani, Mostafa and Gouws, Stephan and Vinyals, Oriol and Uszkoreit, Jakob and Kaiser, ukasz. Universal Transformers. International Conference on Learning Representations. 2019

2019
[21]

Reducing Transformer Depth on Demand with Structured Dropout

Fan, Angela and Grave, Edouard and Joulin, Armand. Reducing Transformer Depth on Demand with Structured Dropout. International Conference on Learning Representations. 2020

2020
[22]

Depth-Adaptive Transformer

Elbayad, Maha and Gu, Jiatao and Grave, Edouard and Auli, Michael. Depth-Adaptive Transformer. International Conference on Learning Representations. 2020

2020
[23]

P onder N et: Learning to Ponder

Banino, Andrea and Balaguer, Jan and Blundell, Charles. P onder N et: Learning to Ponder. International Conference on Machine Learning. 2021

2021
[24]

Matryoshka Representation Learning

Kusupati, Aditya and Bhatt, Gantavya and Rege, Aniket and Wallingford, Matthew and Sber, Aditya and Jain, Prateek and Farhadi, Ali and Kakade, Sham. Matryoshka Representation Learning. Advances in Neural Information Processing Systems. 2022

2022
[25]

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Vladimir and O g uz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020

2020
[26]

MTEB : Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo \" c and Reimers, Nils. MTEB : Massive Text Embedding Benchmark. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023

2023
[27]

BEIR : A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Thakur, Nandan and Reimers, Nils and R \"u ckl \'e , Andreas and Srivastava, Abhishek and Gurevych, Iryna. BEIR : A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2021

2021
[28]

Unsupervised Dense Information Retrieval with Contrastive Learning

Izacard, Gautier and Caron, Mathilde and Hosseini, Lucas and Riedel, Sebastian and Bojanowski, Piotr and Joulin, Armand and Grave, Edouard. Unsupervised Dense Information Retrieval with Contrastive Learning. Transactions on Machine Learning Research. 2022

2022
[29]

S im CSE : Simple Contrastive Learning of Sentence Embeddings

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi. S im CSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

2021
[30]

and Kaiser, ukasz and Polosukhin, Illia

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, ukasz and Polosukhin, Illia. Attention is All You Need. Advances in Neural Information Processing Systems. 2017

2017
[31]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

2019
[32]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin. R o BERT a: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[33]

MPN et: Masked and Permuted Pre-training for Language Understanding

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan. MPN et: Masked and Permuted Pre-training for Language Understanding. Advances in Neural Information Processing Systems. 2020

2020
[34]

Are Sixteen Heads Really Better than One?

Michel, Paul and Levy, Omer and Neubig, Graham. Are Sixteen Heads Really Better than One?. Advances in Neural Information Processing Systems. 2019

2019
[35]

Q8BERT : Quantized 8Bit BERT

Zafrir, Ofir and Boudoukh, Guy and Izsak, Peter and Wasserblat, Moshe. Q8BERT : Quantized 8Bit BERT. Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2-NIPS). 2019

2019
[36]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

2019
[37]

and Duh, Kevin and Andrews, Nicholas

Gordon, Mitchell A. and Duh, Kevin and Andrews, Nicholas. Compressing BERT : Studying the Effects of Weight Pruning on Transfer Learning. Proceedings of the 5th Workshop on Representation Learning for NLP. 2020

2020
[38]

and Keutzer, Kurt

Shen, Sheng and Dong, Zhen and Ye, Jiayu and Ma, Linjian and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W. and Keutzer, Kurt. Q - BERT : Hessian Based Ultra Low Precision Quantization of BERT. Proceedings of the AAAI Conference on Artificial Intelligence. 2020

2020
[39]

ORCA : A Distributed Serving System for Transformer-Based Generative Models

Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon. ORCA : A Distributed Serving System for Transformer-Based Generative Models. Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2022

2022
[40]

and Zhang, Hao and Stoica, Ion

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). 2023

2023
[41]

T urbo T ransformers: an efficient GPU serving system for transformer models

Fang, Jiarui and Yu, Yang and Zhao, Chengduo and Zhou, Jie. T urbo T ransformers: an efficient GPU serving system for transformer models. Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2021

2021
[42]

What Does BERT Learn about the Structure of Language?

Jawahar, Ganesh and Sagot, Beno \^ i t and Seddah, Djam \'e. What Does BERT Learn about the Structure of Language?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

2019
[43]

BERT Rediscovers the Classical NLP Pipeline

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

2019
[44]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020

2020
[45]

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT , ELMo , and GPT -2 Embeddings

Ethayarajh, Kawin. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT , ELMo , and GPT -2 Embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

2019
[46]

Bootstrap Methods: Another Look at the Jackknife

Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics. 1979

1979
[47]

An Empirical Investigation of Statistical Significance in NLP

Berg-Kirkpatrick, Taylor and Burkett, David and Klein, Dan. An Empirical Investigation of Statistical Significance in NLP. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2012

2012
[48]

The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing

Dror, Rotem and Baumer, Gili and Shlomov, Segev and Reichart, Roi. The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018

2018
[49]

S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Cer, Daniel and Diab, Mona and Agirre, Eneko and Lopez-Gazpio, I \ n igo and Specia, Lucia. S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017

2017
[50]

A SICK cure for the evaluation of compositional distributional semantic models

Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto. A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). 2014

2014
[51]

and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015

2015
[52]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018

2018