pith. sign in

arxiv: 2410.04941 · v7 · pith:4WWB5O76new · submitted 2024-10-07 · 💻 cs.LG · cs.AI

TOAST: Transformer Optimization using Adaptive and Simple Transformations

Pith reviewed 2026-05-23 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformer efficiencymodel compressionrepresentation similarityclosed-form approximationvision transformersparameter reductionno retrainingTOAST
0
0 comments X

The pith

Large portions of transformer depth can be replaced by trivial functions like linear maps or the identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Foundation models are computationally expensive due to their size. The paper demonstrates that many transformer blocks produce outputs similar enough to simple linear transformations or even the identity function that they can be swapped out. TOAST performs this replacement using closed-form mappings chosen adaptively, requiring no retraining. This leads to smaller models that run faster on vision tasks from small datasets to ImageNet while maintaining accuracy. A sympathetic reader would care because it points to a way to make powerful models more accessible and sustainable.

Core claim

TOAST is a framework that uses intra-network representation similarities to approximate entire transformer blocks with lightweight closed-form mappings such as linear transformations or the identity function. Applied to pretrained vision models including ViT, DINOv2, and DeiT, it reduces parameters and computation across datasets from MNIST to ImageNet-1k while preserving or improving downstream performance, without any additional training.

What carries the argument

Adaptive replacement of transformer blocks by simple closed-form functions (linear or identity) selected based on representation similarities.

If this is right

  • Model size decreases substantially by removing or simplifying multiple blocks.
  • Computational cost during inference drops due to fewer operations.
  • Downstream task performance remains comparable or better on tested vision datasets.
  • No retraining or fine-tuning is needed for the optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar redundancies might exist in other architectures like language transformers, allowing broader application.
  • Models could be trained with awareness of these approximations to optimize depth from the start.
  • Dynamic selection of which blocks to approximate could adapt to different inputs or tasks.
  • Exploring the limits on more challenging benchmarks would clarify how far the replacement can go.

Load-bearing premise

The internal representations in the transformer are similar enough across blocks that a closed-form linear map or identity can substitute for a full block without performance loss.

What would settle it

Observing a significant accuracy drop on ImageNet-1k after applying TOAST to a standard ViT model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2410.04941 by Bastian Rieck, Emanuele Palumbo, Emanuele Rodol\`a, Irene Cannistraci, Julia E. Vogt, Simone Antonelli, Thomas M. Sutter.

Figure 1
Figure 1. Figure 1: Framework Description. Given two latent spaces X(s) and X(e) corresponding to the outputs of blocks s and e for a random subset of 500 training samples, TOAST estimates a lightweight transformation T such that X(e) ≈ T (X(s) ). This allows entire transformer blocks to be approximated by simple closed-form mappings (e.g., linear or identity), reducing parameters and computation without retraining. As Neural… view at source ↗
Figure 2
Figure 2. Figure 2: Block Similarities. Block-by-block similarities in DiNO-B, and DEiT-S models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the CKA between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specific dataset. A… view at source ↗
Figure 3
Figure 3. Figure 3: Approximation vs. Representation Similarity. CKA between the last block repre￾sentations of the original and the approximated model when approximating the i th block. Original TOAST DiNO-B DEiT-S −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA Visualization. Final block rep￾resentations for original and TOAST models on F-MNIST reveal DiNO-B’s stronger reliance on final block compared to DEiT-S. To quantify the effect of the approximation, we compute the CKA similarity between the final block representations of the original and the TOAST-approximated model for each block k us￾ing its preceding block as input. As shown in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 5
Figure 5. Figure 5: Sample Size Ablation. Classification accuracy as a function of the number of training samples used for approximating different layers of DiNO-B and DEiT-S with a linear transformation using ImageNet1k. Accuracy stabilizes after approximately 500 samples. Takeaway A small number of samples is sufficient to achieve stable and reliable representa￾tions when approximating transformer blocks, balancing efficien… view at source ↗
Figure 6
Figure 6. Figure 6: Block Similarities: Block-by-block similarities in ViT-T, ViT-S, DiNO-S and ViT-B models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the CKA between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specifi… view at source ↗
Figure 7
Figure 7. Figure 7: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its last block approximated from the preceding one. The representations are generated using the DiNO-S model across four datasets. The plots highlight that the last layer representations in this model are crucial, making it more effective to approximate earlier blocks instead. Note … view at source ↗
Figure 8
Figure 8. Figure 8: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its last block approximated by the preceding one. The representations are generated using the DEiT-S model across four datasets. The plots highlight that in this model, the representations in the last layer are redundant and can be effectively approximated, offering potential perfor… view at source ↗
Figure 9
Figure 9. Figure 9: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its second block approximated by the preceding one. The representations are generated using the DiNO-S model across four datasets. Note that for CIFAR-100 (bottom right), only the overall structure of the space can be observed, as the 100 classes make it challenging to distinguish l… view at source ↗
Figure 10
Figure 10. Figure 10: Last Block Approximation. PCA visualization of the last layer representations for both the original model and the model with its second block approximated using the previous one. Representations refer to the using ViT-S model across four datasets. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Last Block Approximation. PCA visualization of the last layer representations for both the original model and the model with its last block approximated from the previous one. Representations refer to the using ViT-S model across four datasets. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy-efficiency trade-off for different approximation strategies. Each subplot shows the accuracy against a different efficiency metric: the number of parameters (left), GFLOPs (center), and inference throughput (right). The image shows that the linear translator achieves a superior accuracy-efficiency trade-off. A.2.6 ANALYSIS OF MISCLASSIFICATIONS In this section, we examine changes in per-class acc… view at source ↗
Figure 13
Figure 13. Figure 13: Per-class accuracy delta on CIFAR-10 when a single block is approximated in ViT-S, DiNO-S and DEiT-S. Cell values indicate the relative change in the accuracy with respect to the original model. Brighter (green) cells indicate an accuracy gain for the class, while darker (blue) cells indicate an accuracy drop. delta along the diagonal). On the other hand, approximating the last block acts as a regularizer… view at source ↗
Figure 14
Figure 14. Figure 14: Normalized relative confusion matrix when single blocks are approximated for DEiT-S on CIFAR-100C. Diagonal cells capture the per-class change in accuracy, whereas off-diagonal cells capture changes in misclassifications between classes. Red (positive) values on the diagonal mean the approximation improves that class’s accuracy. Red off-diagonal values mean more misclassifications. Conversely, blue (negat… view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of misclassified samples after approximating a block of ViT-S on CIFAR-10. Images from CIFAR-10 whose label flips from correct to incorrect when specific blocks are approximated. The title reports the true class followed by the wrong prediction. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
read the original abstract

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The paper introduces TOAST, a framework that exploits intra-network representation similarities in pretrained vision transformers (ViT, DINOv2, DeiT) to replace selected blocks with closed-form lightweight mappings such as linear transformations or the identity function. These replacements are performed without retraining or finetuning. Experiments across datasets from MNIST to ImageNet-1k report that parameter count and computation can be reduced while downstream performance is preserved or improved, supporting the claim that large portions of transformer depth can be replaced by trivial functions.

Significance. If the central empirical claim holds after proper controls, the result would be significant for efficient foundation-model deployment: it would demonstrate that intra-network redundancy can be exploited via simple, training-free substitutions rather than distillation or pruning, and would open a new direction for depth reduction that relies on closed-form mappings instead of learned approximations. The absence of retraining is a notable practical strength.

major comments (4)
  1. [§3.2] §3.2 (Mapping Computation): the procedure for deriving the linear map is not specified. It is unclear whether the least-squares fit is performed on activations from a held-out validation split, the training set, or the evaluation distribution; without this, it is impossible to rule out that block selection or map fitting overfits the reported test metrics.
  2. [§4.2, Table 3] §4.2 and Table 3: no error bars, standard deviations across seeds, or statistical significance tests are reported for the accuracy deltas. Several entries claim small improvements (e.g., +0.3 % on ImageNet) that cannot be distinguished from run-to-run variance, undermining the claim that performance is “preserved or improved.”
  3. [§4.3] §4.3 (Block Selection): the criterion used to decide which blocks are replaced by identity versus linear map versus left unchanged is not described. If selection is performed after measuring downstream accuracy on the test set, the reported compression ratios are post-hoc and the central claim of automatic, similarity-driven replacement is not supported.
  4. [§5] §5 (Ablations): there is no control experiment that applies random block replacement or random linear maps of the same rank; without this baseline it is impossible to determine whether the observed preservation of accuracy is due to the intra-network similarity hypothesis or simply to the robustness of the remaining network.
minor comments (3)
  1. [§3.1] Notation for the linear map (e.g., the matrix W and bias b) is introduced without an explicit equation; add Eq. (X) defining the replacement operation.
  2. [Figure 2] Figure 2 caption does not state the number of models or random seeds used to generate the similarity heatmaps.
  3. [Abstract vs §4.1] The abstract states “across state-of-the-art pretrained vision models” but the experimental section only reports three families; clarify the exact model list in the main text.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Mapping Computation): the procedure for deriving the linear map is not specified. It is unclear whether the least-squares fit is performed on activations from a held-out validation split, the training set, or the evaluation distribution; without this, it is impossible to rule out that block selection or map fitting overfits the reported test metrics.

    Authors: We will revise §3.2 to explicitly describe the mapping procedure. The linear maps are obtained via least-squares regression on activations collected from the training set (with no access to validation or test data). This detail was omitted for brevity but will be added along with the precise optimization objective to eliminate any ambiguity regarding data leakage or overfitting. revision: yes

  2. Referee: [§4.2, Table 3] §4.2 and Table 3: no error bars, standard deviations across seeds, or statistical significance tests are reported for the accuracy deltas. Several entries claim small improvements (e.g., +0.3 % on ImageNet) that cannot be distinguished from run-to-run variance, undermining the claim that performance is “preserved or improved.”

    Authors: We agree that variability measures are important for interpreting small deltas. In the revision we will rerun the key ImageNet experiments across multiple random seeds, report means and standard deviations in Table 3, and add a note on whether the observed changes exceed typical run-to-run variance. The primary claim remains preservation rather than consistent improvement, but the added statistics will allow readers to assess this directly. revision: yes

  3. Referee: [§4.3] §4.3 (Block Selection): the criterion used to decide which blocks are replaced by identity versus linear map versus left unchanged is not described. If selection is performed after measuring downstream accuracy on the test set, the reported compression ratios are post-hoc and the central claim of automatic, similarity-driven replacement is not supported.

    Authors: Block selection is performed solely on the basis of intra-block representation similarity measured on training-set activations (via reconstruction error or cosine similarity between the original block output and the candidate mapping). No test-set accuracy is used at any stage. We will expand §4.3 to state the exact similarity threshold and decision rule, making the automatic, training-only nature of the procedure explicit. revision: yes

  4. Referee: [§5] §5 (Ablations): there is no control experiment that applies random block replacement or random linear maps of the same rank; without this baseline it is impossible to determine whether the observed preservation of accuracy is due to the intra-network similarity hypothesis or simply to the robustness of the remaining network.

    Authors: We will add the requested control in the revised §5. Specifically, we will report results for random block selection followed by either identity substitution or random linear maps of matching rank. Preliminary checks indicate that such random replacements cause clear accuracy degradation relative to the similarity-driven choices; the new table will quantify this gap and thereby support that the performance retention stems from the identified redundancies. revision: yes

Circularity Check

0 steps flagged

No circularity: TOAST presents empirical approximations without self-referential reductions

full rationale

The abstract and description introduce TOAST as a new framework that exploits observed intra-network representation similarities (from prior literature) to replace blocks with closed-form linear maps or identity functions. No equations, fitting procedures, or derivation steps are described that would reduce a claimed prediction or result to its own inputs by construction. The central contribution is an empirical method and evaluation across ViT/DINOv2/DeiT models on MNIST-to-ImageNet, which stands as independent content rather than a renaming, self-citation load-bearing premise, or fitted parameter presented as a prediction. No self-citation chains or ansatzes are invoked in the provided text to justify uniqueness or force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that representation similarity within one network is sufficient for block replacement.

pith-pipeline@v0.9.0 · 5740 in / 1008 out tokens · 23808 ms · 2026-05-23T19:43:02.921248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    Unified data-free compression: Pruning and quantization without fine-tuning

    Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, and Yong Liu. Unified data-free compression: Pruning and quantization without fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5876--5885, 2023

  2. [2]

    Topological data analysis for neural network analysis: A comprehensive survey

    Rubén Ballester, Carles Casacuberta, and Sergio Escalera. Topological data analysis for neural network analysis: A comprehensive survey. arXiv preprint arXiv:2312.05840, December 2023

  3. [3]

    Representation topology divergence: A method for comparing neural network representations

    Serguei Barannikov, Ilya Trofimov, Nikita Balabin, and Evgeny Burnaev. Representation topology divergence: A method for comparing neural network representations. arXiv preprint arXiv:2201.00058, 2021

  4. [4]

    Bootstrapping parallel anchors for relative representations

    Irene Cannistraci, Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, and Emanuele Rodol \` a . Bootstrapping parallel anchors for relative representations. In Krystal Maughan, Rosanne Liu, and Thomas F. Burns (eds.), The First Tiny Papers Track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023 . OpenReview.net, 2023. URL h...

  5. [5]

    From bricks to bridges: Product of invariances to enhance latent space communication

    Irene Cannistraci, Luca Moschella, Marco Fumero, Valentino Maiorca, and Emanuele Rodol \`a . From bricks to bridges: Product of invariances to enhance latent space communication. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vngVydDWft

  6. [6]

    From charts to atlas: Merging latent spaces into one

    Donato Crisostomi, Irene Cannistraci, Luca Moschella, Pietro Barbiero, Marco Ciccone, Pietro Lio, and Emanuele Rodol \`a . From charts to atlas: Merging latent spaces into one. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023. URL https://openreview.net/forum?id=ZFu7CPtznY

  7. [7]

    Reliability of cka as a similarity measure in deep learning

    MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of cka as a similarity measure in deep learning. arXiv preprint arXiv:2210.16156, 2022

  8. [8]

    The mnist database of handwritten digit images for machine learning research

    Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29 0 (6): 0 141--142, 2012

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

  10. [10]

    Latent functional maps: a spectral framework for representation alignment

    Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodol\` a . Latent functional maps: a spectral framework for representation alignment. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 66178--66203. Curran Associ...

  11. [11]

    Relations between two sets of variates

    Harold Hotelling. Relations between two sets of variates. Breakthroughs in statistics: methodology and distribution, pp.\ 162--190, 1992

  12. [12]

    Similarity of neural network models: A survey of functional and representational measures

    Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. arXiv preprint arXiv:2305.06329, 2023

  13. [13]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pp.\ 3519--3529. PMLR, 2019

  14. [14]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  15. [16]

    Internal representations of vision models through the lens of frames on data manifolds

    Henry Kvinge, Grayson Jorgenson, Davis Brown, Charles Godfrey, and Tegan Emerson. Internal representations of vision models through the lens of frames on data manifolds. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2022

  16. [17]

    On the direct alignment of latent spaces

    Zorah L\"ahner and Michael Moeller. On the direct alignment of latent spaces. In Marco Fumero, Emanuele Rodolá, Clementine Domine, Francesco Locatello, Karolina Dziugaite, and Caron Mathilde (eds.), Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, volume 243 of Proceedings of Machine Learning Research, pp.\ 158--169...

  17. [18]

    Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

    Zhu Liao, Victor Qu \'e tu, Van-Tam Nguyen, and Enzo Tartaglione. Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

  18. [19]

    Llm-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36: 0 21702--21720, 2023

  19. [20]

    Latent space translation via semantic alignment

    Valentino Maiorca, Luca Moschella, Antonio Norelli, Marco Fumero, Francesco Locatello, and Emanuele Rodol \`a . Latent space translation via semantic alignment. Advances in Neural Information Processing Systems, 36, 2024

  20. [21]

    Insights on representational similarity in neural networks with canonical correlation

    Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. Advances in Neural Information Processing Systems, 31, 2018

  21. [22]

    Relative representations enable zero-shot latent space communication

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol \`a . Relative representations enable zero-shot latent space communication. In Proc. ICLR, 2023

  22. [23]

    Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth

    Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020

  23. [24]

    Asif: Coupled data turns unimodal models to multimodal without training

    Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodola, and Francesco Locatello. Asif: Coupled data turns unimodal models to multimodal without training. Advances in Neural Information Processing Systems, 36: 0 15303--15319, 2023

  24. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  25. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  26. [27]

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017

  27. [28]

    ImageNet Large Scale Visual Recognition Challenge,

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV), 115 0 (3): 0 211--252, 2015. doi:10.1007/s11263-015-0816-y

  28. [29]

    On the effect of dropping layers of pre-trained transformer models

    Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77: 0 101429, 2023

  29. [30]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

  30. [31]

    Woodfisher: Efficient second-order approximation for neural network compression

    Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 18098--18109. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/202...

  31. [32]

    You need multiple exiting: Dynamic early exiting for accelerating unified vision language model

    Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, and Dongkuan Xu. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10781--10791, 2023

  32. [33]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. arxiv 2020. arXiv preprint arXiv:2012.12877, 2 0 (3), 2020

  33. [34]

    The geometry of hidden representations of large transformer models

    Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36, 2024

  34. [35]

    Convolutional networks with adaptive inference graphs

    Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018

  35. [36]

    Residual networks behave like ensembles of relatively shallow networks

    Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75...

  36. [37]

    Skip-attention: Improving vision transformers by paying less attention

    Shashanka Venkataramanan, Amir Ghodrati, Yuki M Asano, Fatih Porikli, and Amir Habibian. Skip-attention: Improving vision transformers by paying less attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vI95kcLAoU

  37. [38]

    Practical network acceleration with tiny sets

    Guo-Hua Wang and Jianxin Wu. Practical network acceleration with tiny sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  38. [39]

    Davis, Kristen Grauman, and Rogerio Feris

    Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  39. [40]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

  40. [41]

    DeeBERT: Dynam ic Early Exiting for Accelerating BERT Inference,

    Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020

  41. [42]

    Width & depth pruning for vision transformers

    Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, and Li Cui. Width & depth pruning for vision transformers. In Proc. AAAI, 2022

  42. [43]

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

  43. [44]

    Dense vision transformer compression with few samples

    Hanxiao Zhang, Yifan Zhou, and Guo-Hua Wang. Dense vision transformer compression with few samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15825--15834, June 2024

  44. [45]

    Accelerating training of transformer-based language models with progressive layer dropping

    Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping. Advances in neural information processing systems, 33: 0 14011--14023, 2020

  45. [46]

    Bert loses patience: Fast and robust inference with early exit

    Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33: 0 18330--18341, 2020

  46. [47]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  47. [48]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  48. [49]

    , " * write output.state after.block = add.period write newline

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

  49. [50]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...