BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Amlan Chakrabarti; Saptarsi Goswami; Wazib Ansar

arxiv: 2412.05225 · v3 · submitted 2024-12-06 · 💻 cs.CL · cs.AI· cs.NE

BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Wazib Ansar , Saptarsi Goswami , Amlan Chakrabarti This is my paper

Pith reviewed 2026-05-23 07:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.NE

keywords binarized transformersearly exitmodel compressionefficient inferencenatural language processingoverthinking problemtransformer optimizationselective learning

0 comments

The pith

Binarized transformer with early exits reduces model size 21 times while cutting computation and raising accuracy on text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEExformer to combine binarization-aware training with early-exit decisions inside transformer blocks for text inference. It claims this pairing shrinks the model dramatically, speeds up computation through selective early stopping, and actually lifts accuracy by fixing the tendency of deep networks to overthink. The design adds a selective learn-forget network per block and uses fractional entropy drop as the exit signal. A reader would care because current transformers are too large and slow for many devices, and the method suggests these two efficiency tricks can reinforce rather than compete with each other. If the claim holds, binarized early-exit models become a practical route to deploy capable language models under tight resource limits.

Core claim

BEExformer integrates Binarization-Aware Training, which uses a differentiable second-order approximation to the sign function for gradient updates, together with an Early Exit mechanism driven by fractional entropy reduction and soft-routing loss; the result is a 21.30 times smaller model, 52.27 percent fewer FLOPs at inference, and 3.22 percent higher accuracy across nine NLP datasets by resolving overthinking.

What carries the argument

Selective-Learn Forget Network (SLFN) placed inside each transformer block to retain context while discarding irrelevant information, paired with entropy-based early exit routing.

If this is right

Model size drops by a factor of 21.30 through binarization while preserving or improving task performance.
Inference requires 52.27 percent fewer FLOPs because many inputs exit after early blocks.
Accuracy rises 3.22 percent on average by avoiding the overthinking that occurs in full-depth networks.
The same architecture delivers Pareto-optimal accuracy-efficiency balance on nine datasets spanning multiple NLP tasks.
No task-specific post-hoc tuning is required for the entropy exit rule to work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy exit rule could be tested directly on vision or speech transformers to check whether the same signal works outside text.
Hardware implementations could exploit the fixed binarized weights plus variable depth to reduce memory bandwidth on edge chips.
Combining this approach with quantization beyond one bit might produce further size and speed gains without new training tricks.
If overthinking is the main accuracy limiter, similar early-exit logic might help non-binarized large models on long sequences.

Load-bearing premise

Fractional reduction in entropy among intermediate blocks gives a reliable, general signal for when to exit early that improves accuracy without needing task-specific tuning or dataset bias.

What would settle it

Run the trained BEExformer on a fresh NLP dataset outside the original nine and measure whether accuracy still rises by roughly 3 percent while FLOPs drop by half, with no extra hyperparameter search.

Figures

Figures reproduced from arXiv: 2412.05225 by Amlan Chakrabarti, Saptarsi Goswami, Wazib Ansar.

**Figure 2.** Figure 2: Pareto front charts comparing BEExformer with related works as well as its ablations. For a justified [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of exits and total number of parameters saved from computation during inference upon all [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the quantized models for varying [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of plots for sign(r), clip(−1, r, 1), and the proposed binarization function b(r) along with their derivatives. The shaded areas portray the difference between the two functions [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of performance of the proposed binarization function [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.27% and even improves accuracy by 3.22% by resolving the "overthinking" problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across nine datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines binarization-aware training with a second-order sign approx, per-block SLFN, and entropy-drop early exits, claiming big size and FLOP cuts plus an accuracy lift on nine NLP datasets.

read the letter

The paper's main contribution is putting binarization and early exits together in one transformer architecture. It uses a second-order differentiable approximation to the sign function for training, adds a Selective-Learn Forget Network inside each block, and triggers early exits on fractional entropy reduction with a soft-routing loss. This setup is presented as new and is tested with SOTA comparisons plus ablations across nine datasets covering multiple tasks. The reported outcomes are a 21x model size cut, 52% fewer FLOPs, and a 3.22% accuracy gain by cutting overthinking. The ablations are a plus because they try to isolate the pieces. The work targets people who need smaller, faster transformers for text on constrained hardware like mobile or IoT devices. The accuracy improvement stands out as the part that needs the most checking. Binarized models normally lose ground, so a net gain requires tight controls, multiple runs, and clear evidence that the entropy signal works without heavy per-dataset tuning. The entropy fraction threshold is a free parameter, and it is not obvious how sensitive the results are to its choice or to training details. The abstract alone leaves room for questions about whether the comparisons are fully fair or if any post-hoc decisions affected the numbers. Still, the claims are specific enough and the evaluation broad enough that the paper should go to peer review. Reviewers can verify the experiments and see whether the efficiency and accuracy numbers hold up under closer inspection.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BEExformer, a binarized transformer integrating Binarization-Aware Training (BAT) via a differentiable second-order sign approximation with an early-exit (EE) mechanism based on fractional entropy reduction among intermediate blocks, using a Selective-Learn Forget Network (SLFN) and soft-routing loss. The central claims are a 21.30× model size reduction, 52.27% FLOP reduction at inference, and 3.22% accuracy improvement on nine NLP datasets by resolving overthinking, with Pareto-optimal efficiency-accuracy trade-offs versus SOTA methods.

Significance. If the performance numbers prove reproducible with full methodological transparency, the work would be significant for efficient deployment of transformers on constrained hardware. The joint BAT+EE design, with SLFN for contextual retention and a second-order approximation enabling gradient flow through binarization, offers a concrete technical path to simultaneous compression and dynamic computation; the reported accuracy gain alongside efficiency improvements is a notable strength if it generalizes beyond the evaluated sets.

major comments (2)

[Abstract] Abstract: the central performance claims (21.30× size reduction, 52.27% FLOP reduction, 3.22% accuracy gain) rest on SOTA comparisons and ablations across nine datasets, yet the manuscript provides no full methods, exclusion rules, error bars, or training details, preventing verification of whether the numbers reflect post-hoc choices or robust gains.
[Abstract] Abstract (EE mechanism): the fractional entropy reduction is asserted to be a reliable, general signal for early exiting that improves accuracy without task-specific post-hoc tuning, but the entropy reduction fraction threshold is an explicit free parameter; this undermines the claim of dataset-independent behavior across the nine evaluation sets.

minor comments (2)

The abstract refers to 'various ablations' without enumerating the ablated components or reporting their quantitative effects in the provided text.
Acronyms SLFN and BAT are introduced without immediate equation-level definitions, which reduces clarity for readers unfamiliar with the selective-learning and binarization-aware components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting reproducibility and generality concerns. We address each major comment below and commit to revisions that strengthen transparency without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (21.30× size reduction, 52.27% FLOP reduction, 3.22% accuracy gain) rest on SOTA comparisons and ablations across nine datasets, yet the manuscript provides no full methods, exclusion rules, error bars, or training details, preventing verification of whether the numbers reflect post-hoc choices or robust gains.

Authors: We acknowledge the need for greater methodological transparency. While the manuscript details the second-order sign approximation (Section 3.1), SLFN architecture (Section 3.2), entropy-based EE with soft-routing loss (Section 3.3), and experimental protocol (Section 4), we agree that error bars from multiple random seeds, complete hyperparameter tables, and explicit dataset selection/exclusion criteria are currently insufficient. In the revision we will add these elements, including standard deviations across runs and a dedicated reproducibility subsection, to allow independent verification of the reported gains. revision: yes
Referee: [Abstract] Abstract (EE mechanism): the fractional entropy reduction is asserted to be a reliable, general signal for early exiting that improves accuracy without task-specific post-hoc tuning, but the entropy reduction fraction threshold is an explicit free parameter; this undermines the claim of dataset-independent behavior across the nine evaluation sets.

Authors: The fractional entropy reduction threshold is a single fixed hyperparameter applied uniformly across all nine datasets; it was selected once on a validation split and not retuned per task or dataset. This design choice underpins the claim of dataset-independent behavior. We will revise the manuscript to state the exact fixed threshold value, include a sensitivity study showing performance stability around that value, and clarify that no per-dataset post-hoc adjustment was performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical architecture (BAT with second-order sign approximation integrated with entropy-reduction EE via SLFN and soft-routing loss) whose claimed gains in size, FLOPs, and accuracy are obtained from direct experimental comparisons against SOTA baselines across nine external NLP datasets. No derivation chain reduces a reported prediction or uniqueness result to a fitted parameter or self-citation by construction; the entropy signal is introduced as a design choice and validated experimentally rather than defined in terms of the target metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on newly introduced components (SLFN, BAT approximation) and an entropy-based exit rule whose reliability is asserted via experiments rather than external benchmarks or formal derivation.

free parameters (1)

entropy reduction fraction threshold
Used to trigger early exits; value must be chosen or tuned on validation data.

axioms (1)

domain assumption A differentiable second-order approximation to the sign function permits gradient flow that captures both sign and magnitude during binarized training.
Invoked to justify BAT enabling effective parameter updates.

invented entities (2)

Selective-Learn Forget Network (SLFN) no independent evidence
purpose: Enhance contextual retention while eliminating irrelevant information inside each transformer block
New module introduced per block; no independent evidence outside the reported runs.
Binarization-Aware Training (BAT) with second-order sign approximation no independent evidence
purpose: Enable training of binarized weights by providing usable gradients
Specific training procedure proposed in the paper; no external validation cited.

pith-pipeline@v0.9.0 · 5795 in / 1462 out tokens · 47916 ms · 2026-05-23T07:34:30.270357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

A survey on text classification: From traditional to deep learning

Q. Li et al., “A survey on text classification: From traditional to deep learning”, ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1–41, 2022

work page 2022
[2]

Service management and energy scheduling toward low-carbon edge computing,

L. Gu, W. Zhang, Z. Wang, D. Zeng, and H. Jin, "Service management and energy scheduling toward low-carbon edge computing," IEEE Transactions on Sustainable Computing, vol. 8, no. 1, pp. 109–119, 2022

work page 2022
[3]

Carbon emission quantification of machine learning: A review,

S. M. Hasan, T. Islam, M. Saifuzzaman, K. R. Ahmed, C.-H. Huang, and A. R. Shahid, "Carbon emission quantification of machine learning: A review," IEEE Transactions on Sustainable Computing, 2025

work page 2025
[4]

Training green AI models using elite samples,

M. Alswaitti, R. Verdecchia, G. Danoy, P. Bouvry, and J. Pecero, "Training green AI models using elite samples," IEEE Transactions on Sustainable Computing, 2025

work page 2025
[5]

Block Pruning For Faster Transformers

F. Lagunas, E. Charlaix, V . Sanh, and A. M. Rush, “Block Pruning For Faster Transformers”, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10619–10629

work page 2021
[6]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv:1910. 01108, 2019

work page 1910
[7]

Linformer: Self-attention with linear complexity

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity”, arXiv preprint arXiv:2006. 04768, 2020

work page 2006
[8]

Be3r: Bert based early-exit using expert routing

S. Mangrulkar, A. Ms, and V . Sembium, “Be3r: Bert based early-exit using expert routing”, in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3504–3512

work page 2022
[9]

Shallow-deep networks: Understanding and mitigating network overthinking

Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking”, in International conference on machine learning, 2019, pp. 3301–3310

work page 2019
[10]

Early exiting BERT for efficient document ranking

J. Xin, R. Nogueira, Y . Yu, and J. Lin, “Early exiting BERT for efficient document ranking”, in Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, 2020, pp. 83–88

work page 2020
[11]

Q-bert: Hessian based ultra low precision quantization of bert

S. Shen et al., “Q-bert: Hessian based ultra low precision quantization of bert”, in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 8815–8821

work page 2020
[12]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers”, Advances in Neural Information Processing Systems, vol. 35, pp. 27168–27183, 2022

work page 2022
[13]

Bi-real net: Binarizing deep network towards real-network performance

Z. Liu, W. Luo, B. Wu, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Binarizing deep network towards real-network performance”, International Journal of Computer Vision, vol. 128, pp. 202–219, 2020. 16 BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

work page 2020
[14]

Quantized transformer language model implementations on edge devices

M. W. U. Rahman et al., “Quantized transformer language model implementations on edge devices”, in 2023 International Conference on Machine Learning and Applications (ICMLA), 2023, pp. 709–716

work page 2023
[15]

A comprehensive survey on model quantization for deep neural networks in image classification

B. Rokh, A. Azarpeyvand, and A. Khanteymoori, “A comprehensive survey on model quantization for deep neural networks in image classification”, ACM Transactions on Intelligent Systems and Technology, vol. 14, no. 6, pp. 1–50, 2023

work page 2023
[16]

Efficient post-training quantization with fp8 formats

H. Shen, N. Mellempudi, X. He, Q. Gao, C. Wang, and M. Wang, “Efficient post-training quantization with fp8 formats”, Proceedings of Machine Learning and Systems, vol. 6, pp. 483–498, 2024

work page 2024
[17]

Post-training sparsity-aware quantization

G. Shomron, F. Gabbay, S. Kurzum, and U. Weiser, “Post-training sparsity-aware quantization”, Advances in Neural Information Processing Systems, vol. 34, pp. 17737–17748, 2021

work page 2021
[18]

Efficientqat: Efficient quantization-aware training for large language models

M. Chen et al., “Efficientqat: Efficient quantization-aware training for large language models”, arXiv preprint arXiv:2407. 11062, 2024

work page 2024
[19]

Bibench: Benchmarking and analyzing network binarization

H. Qin et al., “Bibench: Benchmarking and analyzing network binarization”, in International Conference on Machine Learning, 2023, pp. 28351–28388

work page 2023
[20]

Binarized neural networks

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Binarized neural networks”, Advances in neural information processing systems, vol. 29, 2016

work page 2016
[21]

BinaryBERT: Pushing the Limit of BERT Quantization

H. Bai et al., “BinaryBERT: Pushing the Limit of BERT Quantization”, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), 2021, pp. 4334–4348

work page 2021
[22]

BiBERT: Accurate Fully Binarized BERT

H. Qin et al., “BiBERT: Accurate Fully Binarized BERT”, in International Conference on Learning Representa- tions

work page
[23]

Bit: Robustly binarized multi-distilled transformer

Z. Liu et al., “Bit: Robustly binarized multi-distilled transformer”, Advances in neural information processing systems, vol. 35, pp. 14303–14316, 2022

work page 2022
[24]

Does knowledge distillation really work?

S. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?”, Advances in Neural Information Processing Systems, vol. 34, pp. 6906–6919, 2021

work page 2021
[25]

Dynamic neural networks: A survey,

Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang, "Dynamic neural networks: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2021

work page 2021
[26]

Serving Transformer Models via Joint Request Scheduling and Batching in the Network Edge,

B. Fu, F. Chen, P. Li, and D. Zeng, "Serving Transformer Models via Joint Request Scheduling and Batching in the Network Edge," IEEE Transactions on Sustainable Computing, 2025

work page 2025
[27]

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

J. Xin, R. Tang, J. Lee, Y . Yu, and J. Lin, “DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2246–2251

work page 2020
[28]

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

X. Liu et al., “Towards Efficient NLP: A Standard Evaluation and A Strong Baseline”, in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3288–3303

work page 2022
[29]

A novel selective learning based transformer encoder architecture with enhanced word representation

W. Ansar, S. Goswami, A. Chakrabarti, and B. Chakraborty, “A novel selective learning based transformer encoder architecture with enhanced word representation”, Applied Intelligence, vol. 53, no. 8, pp. 9424–9443, 2023

work page 2023
[30]

TernaryBERT: Distillation-aware Ultra-low Bit BERT

W. Zhang et al., “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 509–521

work page 2020
[31]

Reducing transformer depth on demand with structured dropout

A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout”, arXiv preprint arXiv:1909. 11556, 2019

work page 1909
[32]

Are sixteen heads really better than one?

P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?”, Advances in neural information processing systems, vol. 32, 2019

work page 2019
[33]

Bert loses patience: Fast and robust inference with early exit

W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses patience: Fast and robust inference with early exit”, Advances in Neural Information Processing Systems, vol. 33, pp. 18330–18341, 2020

work page 2020
[34]

Attention is all you need

A. Vaswani, “Attention is all you need”, Advances in Neural Information Processing Systems, 2017

work page 2017
[35]

Senteval: An evaluation toolkit for universal sentence representations

A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations”, arXiv preprint arXiv:1803. 05449, 2018

work page 2018
[36]

Neural Network Acceptability Judgments

A. Warstadt, “Neural Network Acceptability Judgments”, arXiv preprint arXiv:1805. 12471, 2019

work page 2019
[37]

Automatically constructing a corpus of sentential paraphrases

B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases”, in Third international workshop on paraphrasing (IWP2005), 2005

work page 2005
[38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

A. Wang, “Glue: A multi-task benchmark and analysis platform for natural language understanding”, arXiv preprint arXiv:1804. 07461, 2018. 17 BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

work page 2018
[39]

A broad-coverage challenge corpus for sentence understanding through inference

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference”, arXiv preprint arXiv:1704. 05426, 2017

work page 2017
[40]

Bert: Pre-training of deep bidirectional transformers for language understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[41]

Albert: A lite bert for self-supervised learning of language representations

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations”, arXiv preprint arXiv:1909. 11942, 2019

work page 1909
[42]

Roberta: A robustly optimized bert pretraining approach

Y . Liu et al., “Roberta: A robustly optimized bert pretraining approach”, arXiv preprint arXiv:1907. 11692, 2019. 18

work page 1907

[1] [1]

A survey on text classification: From traditional to deep learning

Q. Li et al., “A survey on text classification: From traditional to deep learning”, ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1–41, 2022

work page 2022

[2] [2]

Service management and energy scheduling toward low-carbon edge computing,

L. Gu, W. Zhang, Z. Wang, D. Zeng, and H. Jin, "Service management and energy scheduling toward low-carbon edge computing," IEEE Transactions on Sustainable Computing, vol. 8, no. 1, pp. 109–119, 2022

work page 2022

[3] [3]

Carbon emission quantification of machine learning: A review,

S. M. Hasan, T. Islam, M. Saifuzzaman, K. R. Ahmed, C.-H. Huang, and A. R. Shahid, "Carbon emission quantification of machine learning: A review," IEEE Transactions on Sustainable Computing, 2025

work page 2025

[4] [4]

Training green AI models using elite samples,

M. Alswaitti, R. Verdecchia, G. Danoy, P. Bouvry, and J. Pecero, "Training green AI models using elite samples," IEEE Transactions on Sustainable Computing, 2025

work page 2025

[5] [5]

Block Pruning For Faster Transformers

F. Lagunas, E. Charlaix, V . Sanh, and A. M. Rush, “Block Pruning For Faster Transformers”, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10619–10629

work page 2021

[6] [6]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv:1910. 01108, 2019

work page 1910

[7] [7]

Linformer: Self-attention with linear complexity

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity”, arXiv preprint arXiv:2006. 04768, 2020

work page 2006

[8] [8]

Be3r: Bert based early-exit using expert routing

S. Mangrulkar, A. Ms, and V . Sembium, “Be3r: Bert based early-exit using expert routing”, in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3504–3512

work page 2022

[9] [9]

Shallow-deep networks: Understanding and mitigating network overthinking

Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking”, in International conference on machine learning, 2019, pp. 3301–3310

work page 2019

[10] [10]

Early exiting BERT for efficient document ranking

J. Xin, R. Nogueira, Y . Yu, and J. Lin, “Early exiting BERT for efficient document ranking”, in Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, 2020, pp. 83–88

work page 2020

[11] [11]

Q-bert: Hessian based ultra low precision quantization of bert

S. Shen et al., “Q-bert: Hessian based ultra low precision quantization of bert”, in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 8815–8821

work page 2020

[12] [12]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers”, Advances in Neural Information Processing Systems, vol. 35, pp. 27168–27183, 2022

work page 2022

[13] [13]

Bi-real net: Binarizing deep network towards real-network performance

Z. Liu, W. Luo, B. Wu, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Binarizing deep network towards real-network performance”, International Journal of Computer Vision, vol. 128, pp. 202–219, 2020. 16 BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

work page 2020

[14] [14]

Quantized transformer language model implementations on edge devices

M. W. U. Rahman et al., “Quantized transformer language model implementations on edge devices”, in 2023 International Conference on Machine Learning and Applications (ICMLA), 2023, pp. 709–716

work page 2023

[15] [15]

A comprehensive survey on model quantization for deep neural networks in image classification

B. Rokh, A. Azarpeyvand, and A. Khanteymoori, “A comprehensive survey on model quantization for deep neural networks in image classification”, ACM Transactions on Intelligent Systems and Technology, vol. 14, no. 6, pp. 1–50, 2023

work page 2023

[16] [16]

Efficient post-training quantization with fp8 formats

H. Shen, N. Mellempudi, X. He, Q. Gao, C. Wang, and M. Wang, “Efficient post-training quantization with fp8 formats”, Proceedings of Machine Learning and Systems, vol. 6, pp. 483–498, 2024

work page 2024

[17] [17]

Post-training sparsity-aware quantization

G. Shomron, F. Gabbay, S. Kurzum, and U. Weiser, “Post-training sparsity-aware quantization”, Advances in Neural Information Processing Systems, vol. 34, pp. 17737–17748, 2021

work page 2021

[18] [18]

Efficientqat: Efficient quantization-aware training for large language models

M. Chen et al., “Efficientqat: Efficient quantization-aware training for large language models”, arXiv preprint arXiv:2407. 11062, 2024

work page 2024

[19] [19]

Bibench: Benchmarking and analyzing network binarization

H. Qin et al., “Bibench: Benchmarking and analyzing network binarization”, in International Conference on Machine Learning, 2023, pp. 28351–28388

work page 2023

[20] [20]

Binarized neural networks

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Binarized neural networks”, Advances in neural information processing systems, vol. 29, 2016

work page 2016

[21] [21]

BinaryBERT: Pushing the Limit of BERT Quantization

H. Bai et al., “BinaryBERT: Pushing the Limit of BERT Quantization”, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), 2021, pp. 4334–4348

work page 2021

[22] [22]

BiBERT: Accurate Fully Binarized BERT

H. Qin et al., “BiBERT: Accurate Fully Binarized BERT”, in International Conference on Learning Representa- tions

work page

[23] [23]

Bit: Robustly binarized multi-distilled transformer

Z. Liu et al., “Bit: Robustly binarized multi-distilled transformer”, Advances in neural information processing systems, vol. 35, pp. 14303–14316, 2022

work page 2022

[24] [24]

Does knowledge distillation really work?

S. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?”, Advances in Neural Information Processing Systems, vol. 34, pp. 6906–6919, 2021

work page 2021

[25] [25]

Dynamic neural networks: A survey,

Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang, "Dynamic neural networks: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, 2021

work page 2021

[26] [26]

Serving Transformer Models via Joint Request Scheduling and Batching in the Network Edge,

B. Fu, F. Chen, P. Li, and D. Zeng, "Serving Transformer Models via Joint Request Scheduling and Batching in the Network Edge," IEEE Transactions on Sustainable Computing, 2025

work page 2025

[27] [27]

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

J. Xin, R. Tang, J. Lee, Y . Yu, and J. Lin, “DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2246–2251

work page 2020

[28] [28]

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

X. Liu et al., “Towards Efficient NLP: A Standard Evaluation and A Strong Baseline”, in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3288–3303

work page 2022

[29] [29]

A novel selective learning based transformer encoder architecture with enhanced word representation

W. Ansar, S. Goswami, A. Chakrabarti, and B. Chakraborty, “A novel selective learning based transformer encoder architecture with enhanced word representation”, Applied Intelligence, vol. 53, no. 8, pp. 9424–9443, 2023

work page 2023

[30] [30]

TernaryBERT: Distillation-aware Ultra-low Bit BERT

W. Zhang et al., “TernaryBERT: Distillation-aware Ultra-low Bit BERT”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 509–521

work page 2020

[31] [31]

Reducing transformer depth on demand with structured dropout

A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout”, arXiv preprint arXiv:1909. 11556, 2019

work page 1909

[32] [32]

Are sixteen heads really better than one?

P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?”, Advances in neural information processing systems, vol. 32, 2019

work page 2019

[33] [33]

Bert loses patience: Fast and robust inference with early exit

W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses patience: Fast and robust inference with early exit”, Advances in Neural Information Processing Systems, vol. 33, pp. 18330–18341, 2020

work page 2020

[34] [34]

Attention is all you need

A. Vaswani, “Attention is all you need”, Advances in Neural Information Processing Systems, 2017

work page 2017

[35] [35]

Senteval: An evaluation toolkit for universal sentence representations

A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations”, arXiv preprint arXiv:1803. 05449, 2018

work page 2018

[36] [36]

Neural Network Acceptability Judgments

A. Warstadt, “Neural Network Acceptability Judgments”, arXiv preprint arXiv:1805. 12471, 2019

work page 2019

[37] [37]

Automatically constructing a corpus of sentential paraphrases

B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases”, in Third international workshop on paraphrasing (IWP2005), 2005

work page 2005

[38] [38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

A. Wang, “Glue: A multi-task benchmark and analysis platform for natural language understanding”, arXiv preprint arXiv:1804. 07461, 2018. 17 BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

work page 2018

[39] [39]

A broad-coverage challenge corpus for sentence understanding through inference

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference”, arXiv preprint arXiv:1704. 05426, 2017

work page 2017

[40] [40]

Bert: Pre-training of deep bidirectional transformers for language understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[41] [41]

Albert: A lite bert for self-supervised learning of language representations

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations”, arXiv preprint arXiv:1909. 11942, 2019

work page 1909

[42] [42]

Roberta: A robustly optimized bert pretraining approach

Y . Liu et al., “Roberta: A robustly optimized bert pretraining approach”, arXiv preprint arXiv:1907. 11692, 2019. 18

work page 1907