TOAST: Transformer Optimization using Adaptive and Simple Transformations

Bastian Rieck; Emanuele Palumbo; Emanuele Rodol\`a; Irene Cannistraci; Julia E. Vogt; Simone Antonelli; Thomas M. Sutter

arxiv: 2410.04941 · v7 · pith:4WWB5O76new · submitted 2024-10-07 · 💻 cs.LG · cs.AI

TOAST: Transformer Optimization using Adaptive and Simple Transformations

Irene Cannistraci , Simone Antonelli , Emanuele Palumbo , Thomas M. Sutter , Emanuele Rodol\`a , Bastian Rieck , Julia E. Vogt This is my paper

Pith reviewed 2026-05-23 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer efficiencymodel compressionrepresentation similarityclosed-form approximationvision transformersparameter reductionno retrainingTOAST

0 comments

The pith

Large portions of transformer depth can be replaced by trivial functions like linear maps or the identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Foundation models are computationally expensive due to their size. The paper demonstrates that many transformer blocks produce outputs similar enough to simple linear transformations or even the identity function that they can be swapped out. TOAST performs this replacement using closed-form mappings chosen adaptively, requiring no retraining. This leads to smaller models that run faster on vision tasks from small datasets to ImageNet while maintaining accuracy. A sympathetic reader would care because it points to a way to make powerful models more accessible and sustainable.

Core claim

TOAST is a framework that uses intra-network representation similarities to approximate entire transformer blocks with lightweight closed-form mappings such as linear transformations or the identity function. Applied to pretrained vision models including ViT, DINOv2, and DeiT, it reduces parameters and computation across datasets from MNIST to ImageNet-1k while preserving or improving downstream performance, without any additional training.

What carries the argument

Adaptive replacement of transformer blocks by simple closed-form functions (linear or identity) selected based on representation similarities.

If this is right

Model size decreases substantially by removing or simplifying multiple blocks.
Computational cost during inference drops due to fewer operations.
Downstream task performance remains comparable or better on tested vision datasets.
No retraining or fine-tuning is needed for the optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar redundancies might exist in other architectures like language transformers, allowing broader application.
Models could be trained with awareness of these approximations to optimize depth from the start.
Dynamic selection of which blocks to approximate could adapt to different inputs or tasks.
Exploring the limits on more challenging benchmarks would clarify how far the replacement can go.

Load-bearing premise

The internal representations in the transformer are similar enough across blocks that a closed-form linear map or identity can substitute for a full block without performance loss.

What would settle it

Observing a significant accuracy drop on ImageNet-1k after applying TOAST to a standard ViT model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2410.04941 by Bastian Rieck, Emanuele Palumbo, Emanuele Rodol\`a, Irene Cannistraci, Julia E. Vogt, Simone Antonelli, Thomas M. Sutter.

**Figure 1.** Figure 1: Framework Description. Given two latent spaces X(s) and X(e) corresponding to the outputs of blocks s and e for a random subset of 500 training samples, TOAST estimates a lightweight transformation T such that X(e) ≈ T (X(s) ). This allows entire transformer blocks to be approximated by simple closed-form mappings (e.g., linear or identity), reducing parameters and computation without retraining. As Neural… view at source ↗

**Figure 2.** Figure 2: Block Similarities. Block-by-block similarities in DiNO-B, and DEiT-S models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the CKA between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specific dataset. A… view at source ↗

**Figure 3.** Figure 3: Approximation vs. Representation Similarity. CKA between the last block representations of the original and the approximated model when approximating the i th block. Original TOAST DiNO-B DEiT-S −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: PCA Visualization. Final block representations for original and TOAST models on F-MNIST reveal DiNO-B’s stronger reliance on final block compared to DEiT-S. To quantify the effect of the approximation, we compute the CKA similarity between the final block representations of the original and the TOAST-approximated model for each block k using its preceding block as input. As shown in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 5.** Figure 5: Sample Size Ablation. Classification accuracy as a function of the number of training samples used for approximating different layers of DiNO-B and DEiT-S with a linear transformation using ImageNet1k. Accuracy stabilizes after approximately 500 samples. Takeaway A small number of samples is sufficient to achieve stable and reliable representations when approximating transformer blocks, balancing efficien… view at source ↗

**Figure 6.** Figure 6: Block Similarities: Block-by-block similarities in ViT-T, ViT-S, DiNO-S and ViT-B models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the CKA between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specifi… view at source ↗

**Figure 7.** Figure 7: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its last block approximated from the preceding one. The representations are generated using the DiNO-S model across four datasets. The plots highlight that the last layer representations in this model are crucial, making it more effective to approximate earlier blocks instead. Note … view at source ↗

**Figure 8.** Figure 8: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its last block approximated by the preceding one. The representations are generated using the DEiT-S model across four datasets. The plots highlight that in this model, the representations in the last layer are redundant and can be effectively approximated, offering potential perfor… view at source ↗

**Figure 9.** Figure 9: Last Block Approximation. PCA visualization of the final layer representations for both the original model and the model with its second block approximated by the preceding one. The representations are generated using the DiNO-S model across four datasets. Note that for CIFAR-100 (bottom right), only the overall structure of the space can be observed, as the 100 classes make it challenging to distinguish l… view at source ↗

**Figure 10.** Figure 10: Last Block Approximation. PCA visualization of the last layer representations for both the original model and the model with its second block approximated using the previous one. Representations refer to the using ViT-S model across four datasets. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Last Block Approximation. PCA visualization of the last layer representations for both the original model and the model with its last block approximated from the previous one. Representations refer to the using ViT-S model across four datasets. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy-efficiency trade-off for different approximation strategies. Each subplot shows the accuracy against a different efficiency metric: the number of parameters (left), GFLOPs (center), and inference throughput (right). The image shows that the linear translator achieves a superior accuracy-efficiency trade-off. A.2.6 ANALYSIS OF MISCLASSIFICATIONS In this section, we examine changes in per-class acc… view at source ↗

**Figure 13.** Figure 13: Per-class accuracy delta on CIFAR-10 when a single block is approximated in ViT-S, DiNO-S and DEiT-S. Cell values indicate the relative change in the accuracy with respect to the original model. Brighter (green) cells indicate an accuracy gain for the class, while darker (blue) cells indicate an accuracy drop. delta along the diagonal). On the other hand, approximating the last block acts as a regularizer… view at source ↗

**Figure 14.** Figure 14: Normalized relative confusion matrix when single blocks are approximated for DEiT-S on CIFAR-100C. Diagonal cells capture the per-class change in accuracy, whereas off-diagonal cells capture changes in misclassifications between classes. Red (positive) values on the diagonal mean the approximation improves that class’s accuracy. Red off-diagonal values mean more misclassifications. Conversely, blue (negat… view at source ↗

**Figure 15.** Figure 15: Visualization of misclassified samples after approximating a block of ViT-S on CIFAR-10. Images from CIFAR-10 whose label flips from correct to incorrect when specific blocks are approximated. The title reports the true class followed by the wrong prediction. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

read the original abstract

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TOAST replaces some transformer blocks with closed-form linear or identity maps without retraining, but the abstract gives no details on how maps are fit or blocks chosen, so the central claim is hard to judge.

read the letter

The main takeaway is that TOAST swaps entire transformer blocks for simple closed-form replacements like linear maps or the identity function inside a single pretrained model, with no extra training, and reports that performance holds or improves on vision tasks from MNIST to ImageNet across ViT, DINOv2, and DeiT. This is a direct application of intra-network representation similarities to cut depth and compute. It extends prior stitching and merging work by staying inside one network rather than across models, which is a reasonable next step even if the underlying similarity observations are not new. The experiments cover multiple models and dataset scales, which is a positive point if the numbers are reported with proper controls. The soft spot is exactly what the stress-test note flags: the abstract supplies zero information on the fitting procedure for the linear maps, the rule for picking which blocks to replace, or whether any of this is done on the evaluation data. Without that, it is impossible to know if the replacements are genuinely robust or if block selection and map fitting were tuned after the fact. If the full paper shows pre-specified choices and maps fit on separate data splits, the results become more credible; otherwise the performance preservation could be an artifact. This work is aimed at people already working on transformer efficiency through representation analysis. A reader interested in practical depth reduction tricks without retraining could get something out of it once the methods are clear. It deserves peer review so the authors can supply the missing procedural details and any statistical checks.

Referee Report

4 major / 3 minor

Summary. The paper introduces TOAST, a framework that exploits intra-network representation similarities in pretrained vision transformers (ViT, DINOv2, DeiT) to replace selected blocks with closed-form lightweight mappings such as linear transformations or the identity function. These replacements are performed without retraining or finetuning. Experiments across datasets from MNIST to ImageNet-1k report that parameter count and computation can be reduced while downstream performance is preserved or improved, supporting the claim that large portions of transformer depth can be replaced by trivial functions.

Significance. If the central empirical claim holds after proper controls, the result would be significant for efficient foundation-model deployment: it would demonstrate that intra-network redundancy can be exploited via simple, training-free substitutions rather than distillation or pruning, and would open a new direction for depth reduction that relies on closed-form mappings instead of learned approximations. The absence of retraining is a notable practical strength.

major comments (4)

[§3.2] §3.2 (Mapping Computation): the procedure for deriving the linear map is not specified. It is unclear whether the least-squares fit is performed on activations from a held-out validation split, the training set, or the evaluation distribution; without this, it is impossible to rule out that block selection or map fitting overfits the reported test metrics.
[§4.2, Table 3] §4.2 and Table 3: no error bars, standard deviations across seeds, or statistical significance tests are reported for the accuracy deltas. Several entries claim small improvements (e.g., +0.3 % on ImageNet) that cannot be distinguished from run-to-run variance, undermining the claim that performance is “preserved or improved.”
[§4.3] §4.3 (Block Selection): the criterion used to decide which blocks are replaced by identity versus linear map versus left unchanged is not described. If selection is performed after measuring downstream accuracy on the test set, the reported compression ratios are post-hoc and the central claim of automatic, similarity-driven replacement is not supported.
[§5] §5 (Ablations): there is no control experiment that applies random block replacement or random linear maps of the same rank; without this baseline it is impossible to determine whether the observed preservation of accuracy is due to the intra-network similarity hypothesis or simply to the robustness of the remaining network.

minor comments (3)

[§3.1] Notation for the linear map (e.g., the matrix W and bias b) is introduced without an explicit equation; add Eq. (X) defining the replacement operation.
[Figure 2] Figure 2 caption does not state the number of models or random seeds used to generate the similarity heatmaps.
[Abstract vs §4.1] The abstract states “across state-of-the-art pretrained vision models” but the experimental section only reports three families; clarify the exact model list in the main text.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Mapping Computation): the procedure for deriving the linear map is not specified. It is unclear whether the least-squares fit is performed on activations from a held-out validation split, the training set, or the evaluation distribution; without this, it is impossible to rule out that block selection or map fitting overfits the reported test metrics.

Authors: We will revise §3.2 to explicitly describe the mapping procedure. The linear maps are obtained via least-squares regression on activations collected from the training set (with no access to validation or test data). This detail was omitted for brevity but will be added along with the precise optimization objective to eliminate any ambiguity regarding data leakage or overfitting. revision: yes
Referee: [§4.2, Table 3] §4.2 and Table 3: no error bars, standard deviations across seeds, or statistical significance tests are reported for the accuracy deltas. Several entries claim small improvements (e.g., +0.3 % on ImageNet) that cannot be distinguished from run-to-run variance, undermining the claim that performance is “preserved or improved.”

Authors: We agree that variability measures are important for interpreting small deltas. In the revision we will rerun the key ImageNet experiments across multiple random seeds, report means and standard deviations in Table 3, and add a note on whether the observed changes exceed typical run-to-run variance. The primary claim remains preservation rather than consistent improvement, but the added statistics will allow readers to assess this directly. revision: yes
Referee: [§4.3] §4.3 (Block Selection): the criterion used to decide which blocks are replaced by identity versus linear map versus left unchanged is not described. If selection is performed after measuring downstream accuracy on the test set, the reported compression ratios are post-hoc and the central claim of automatic, similarity-driven replacement is not supported.

Authors: Block selection is performed solely on the basis of intra-block representation similarity measured on training-set activations (via reconstruction error or cosine similarity between the original block output and the candidate mapping). No test-set accuracy is used at any stage. We will expand §4.3 to state the exact similarity threshold and decision rule, making the automatic, training-only nature of the procedure explicit. revision: yes
Referee: [§5] §5 (Ablations): there is no control experiment that applies random block replacement or random linear maps of the same rank; without this baseline it is impossible to determine whether the observed preservation of accuracy is due to the intra-network similarity hypothesis or simply to the robustness of the remaining network.

Authors: We will add the requested control in the revised §5. Specifically, we will report results for random block selection followed by either identity substitution or random linear maps of matching rank. Preliminary checks indicate that such random replacements cause clear accuracy degradation relative to the similarity-driven choices; the new table will quantify this gap and thereby support that the performance retention stems from the identified redundancies. revision: yes

Circularity Check

0 steps flagged

No circularity: TOAST presents empirical approximations without self-referential reductions

full rationale

The abstract and description introduce TOAST as a new framework that exploits observed intra-network representation similarities (from prior literature) to replace blocks with closed-form linear maps or identity functions. No equations, fitting procedures, or derivation steps are described that would reduce a claimed prediction or result to its own inputs by construction. The central contribution is an empirical method and evaluation across ViT/DINOv2/DeiT models on MNIST-to-ImageNet, which stands as independent content rather than a renaming, self-citation load-bearing premise, or fitted parameter presented as a prediction. No self-citation chains or ansatzes are invoked in the provided text to justify uniqueness or force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that representation similarity within one network is sufficient for block replacement.

pith-pipeline@v0.9.0 · 5740 in / 1008 out tokens · 23808 ms · 2026-05-23T19:43:02.921248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Unified data-free compression: Pruning and quantization without fine-tuning

Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, and Yong Liu. Unified data-free compression: Pruning and quantization without fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5876--5885, 2023

work page 2023
[2]

Topological data analysis for neural network analysis: A comprehensive survey

Rubén Ballester, Carles Casacuberta, and Sergio Escalera. Topological data analysis for neural network analysis: A comprehensive survey. arXiv preprint arXiv:2312.05840, December 2023

work page arXiv 2023
[3]

Representation topology divergence: A method for comparing neural network representations

Serguei Barannikov, Ilya Trofimov, Nikita Balabin, and Evgeny Burnaev. Representation topology divergence: A method for comparing neural network representations. arXiv preprint arXiv:2201.00058, 2021

work page arXiv 2021
[4]

Bootstrapping parallel anchors for relative representations

Irene Cannistraci, Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, and Emanuele Rodol \` a . Bootstrapping parallel anchors for relative representations. In Krystal Maughan, Rosanne Liu, and Thomas F. Burns (eds.), The First Tiny Papers Track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023 . OpenReview.net, 2023. URL h...

work page 2023
[5]

From bricks to bridges: Product of invariances to enhance latent space communication

Irene Cannistraci, Luca Moschella, Marco Fumero, Valentino Maiorca, and Emanuele Rodol \`a . From bricks to bridges: Product of invariances to enhance latent space communication. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vngVydDWft

work page 2024
[6]

From charts to atlas: Merging latent spaces into one

Donato Crisostomi, Irene Cannistraci, Luca Moschella, Pietro Barbiero, Marco Ciccone, Pietro Lio, and Emanuele Rodol \`a . From charts to atlas: Merging latent spaces into one. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023. URL https://openreview.net/forum?id=ZFu7CPtznY

work page 2023
[7]

Reliability of cka as a similarity measure in deep learning

MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of cka as a similarity measure in deep learning. arXiv preprint arXiv:2210.16156, 2022

work page arXiv 2022
[8]

The mnist database of handwritten digit images for machine learning research

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29 0 (6): 0 141--142, 2012

work page 2012
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

work page 2021
[10]

Latent functional maps: a spectral framework for representation alignment

Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodol\` a . Latent functional maps: a spectral framework for representation alignment. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 66178--66203. Curran Associ...

work page 2024
[11]

Relations between two sets of variates

Harold Hotelling. Relations between two sets of variates. Breakthroughs in statistics: methodology and distribution, pp.\ 162--190, 1992

work page 1992
[12]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. arXiv preprint arXiv:2305.06329, 2023

work page arXiv 2023
[13]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pp.\ 3519--3529. PMLR, 2019

work page 2019
[14]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[16]

Internal representations of vision models through the lens of frames on data manifolds

Henry Kvinge, Grayson Jorgenson, Davis Brown, Charles Godfrey, and Tegan Emerson. Internal representations of vision models through the lens of frames on data manifolds. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2022

work page 2023
[17]

On the direct alignment of latent spaces

Zorah L\"ahner and Michael Moeller. On the direct alignment of latent spaces. In Marco Fumero, Emanuele Rodolá, Clementine Domine, Francesco Locatello, Karolina Dziugaite, and Caron Mathilde (eds.), Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, volume 243 of Proceedings of Machine Learning Research, pp.\ 158--169...

work page 2024
[18]

Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

Zhu Liao, Victor Qu \'e tu, Van-Tam Nguyen, and Enzo Tartaglione. Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

work page 2023
[19]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36: 0 21702--21720, 2023

work page 2023
[20]

Latent space translation via semantic alignment

Valentino Maiorca, Luca Moschella, Antonio Norelli, Marco Fumero, Francesco Locatello, and Emanuele Rodol \`a . Latent space translation via semantic alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[21]

Insights on representational similarity in neural networks with canonical correlation

Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[22]

Relative representations enable zero-shot latent space communication

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol \`a . Relative representations enable zero-shot latent space communication. In Proc. ICLR, 2023

work page 2023
[23]

Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth

Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020

work page arXiv 2010
[24]

Asif: Coupled data turns unimodal models to multimodal without training

Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodola, and Francesco Locatello. Asif: Coupled data turns unimodal models to multimodal without training. Advances in Neural Information Processing Systems, 36: 0 15303--15319, 2023

work page 2023
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[27]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017

work page 2017
[28]

Berg and Li Fei-Fei , Title =

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV), 115 0 (3): 0 211--252, 2015. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[29]

On the effect of dropping layers of pre-trained transformer models

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77: 0 101429, 2023

work page 2023
[30]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

work page 2022
[31]

Woodfisher: Efficient second-order approximation for neural network compression

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 18098--18109. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/202...

work page 2020
[32]

You need multiple exiting: Dynamic early exiting for accelerating unified vision language model

Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, and Dongkuan Xu. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10781--10791, 2023

work page 2023
[33]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. arxiv 2020. arXiv preprint arXiv:2012.12877, 2 0 (3), 2020

work page arXiv 2020
[34]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Convolutional networks with adaptive inference graphs

Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018

work page 2018
[36]

Residual networks behave like ensembles of relatively shallow networks

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75...

work page 2016
[37]

Skip-attention: Improving vision transformers by paying less attention

Shashanka Venkataramanan, Amir Ghodrati, Yuki M Asano, Fatih Porikli, and Amir Habibian. Skip-attention: Improving vision transformers by paying less attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vI95kcLAoU

work page 2024
[38]

Practical network acceleration with tiny sets

Guo-Hua Wang and Jianxin Wu. Practical network acceleration with tiny sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[39]

Davis, Kristen Grauman, and Rogerio Feris

Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[40]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

DeeBERT: Dynam ic Early Exiting for Accelerating BERT Inference,

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020

work page arXiv 2004
[42]

Width & depth pruning for vision transformers

Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, and Li Cui. Width & depth pruning for vision transformers. In Proc. AAAI, 2022

work page 2022
[43]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[44]

Dense vision transformer compression with few samples

Hanxiao Zhang, Yifan Zhou, and Guo-Hua Wang. Dense vision transformer compression with few samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15825--15834, June 2024

work page 2024
[45]

Accelerating training of transformer-based language models with progressive layer dropping

Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping. Advances in neural information processing systems, 33: 0 14011--14023, 2020

work page 2020
[46]

Bert loses patience: Fast and robust inference with early exit

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33: 0 18330--18341, 2020

work page 2020
[47]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[48]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[49]

, " * write output.state after.block = add.period write newline

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5281/zenodo.7083378 2022
[50]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Unified data-free compression: Pruning and quantization without fine-tuning

Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, and Yong Liu. Unified data-free compression: Pruning and quantization without fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5876--5885, 2023

work page 2023

[2] [2]

Topological data analysis for neural network analysis: A comprehensive survey

Rubén Ballester, Carles Casacuberta, and Sergio Escalera. Topological data analysis for neural network analysis: A comprehensive survey. arXiv preprint arXiv:2312.05840, December 2023

work page arXiv 2023

[3] [3]

Representation topology divergence: A method for comparing neural network representations

Serguei Barannikov, Ilya Trofimov, Nikita Balabin, and Evgeny Burnaev. Representation topology divergence: A method for comparing neural network representations. arXiv preprint arXiv:2201.00058, 2021

work page arXiv 2021

[4] [4]

Bootstrapping parallel anchors for relative representations

Irene Cannistraci, Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, and Emanuele Rodol \` a . Bootstrapping parallel anchors for relative representations. In Krystal Maughan, Rosanne Liu, and Thomas F. Burns (eds.), The First Tiny Papers Track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023 . OpenReview.net, 2023. URL h...

work page 2023

[5] [5]

From bricks to bridges: Product of invariances to enhance latent space communication

Irene Cannistraci, Luca Moschella, Marco Fumero, Valentino Maiorca, and Emanuele Rodol \`a . From bricks to bridges: Product of invariances to enhance latent space communication. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vngVydDWft

work page 2024

[6] [6]

From charts to atlas: Merging latent spaces into one

Donato Crisostomi, Irene Cannistraci, Luca Moschella, Pietro Barbiero, Marco Ciccone, Pietro Lio, and Emanuele Rodol \`a . From charts to atlas: Merging latent spaces into one. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023. URL https://openreview.net/forum?id=ZFu7CPtznY

work page 2023

[7] [7]

Reliability of cka as a similarity measure in deep learning

MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of cka as a similarity measure in deep learning. arXiv preprint arXiv:2210.16156, 2022

work page arXiv 2022

[8] [8]

The mnist database of handwritten digit images for machine learning research

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29 0 (6): 0 141--142, 2012

work page 2012

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

work page 2021

[10] [10]

Latent functional maps: a spectral framework for representation alignment

Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodol\` a . Latent functional maps: a spectral framework for representation alignment. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 66178--66203. Curran Associ...

work page 2024

[11] [11]

Relations between two sets of variates

Harold Hotelling. Relations between two sets of variates. Breakthroughs in statistics: methodology and distribution, pp.\ 162--190, 1992

work page 1992

[12] [12]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. arXiv preprint arXiv:2305.06329, 2023

work page arXiv 2023

[13] [13]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pp.\ 3519--3529. PMLR, 2019

work page 2019

[14] [14]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[15] [16]

Internal representations of vision models through the lens of frames on data manifolds

Henry Kvinge, Grayson Jorgenson, Davis Brown, Charles Godfrey, and Tegan Emerson. Internal representations of vision models through the lens of frames on data manifolds. In NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2022

work page 2023

[16] [17]

On the direct alignment of latent spaces

Zorah L\"ahner and Michael Moeller. On the direct alignment of latent spaces. In Marco Fumero, Emanuele Rodolá, Clementine Domine, Francesco Locatello, Karolina Dziugaite, and Caron Mathilde (eds.), Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, volume 243 of Proceedings of Machine Learning Research, pp.\ 158--169...

work page 2024

[17] [18]

Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

Zhu Liao, Victor Qu \'e tu, Van-Tam Nguyen, and Enzo Tartaglione. Can unstructured pruning reduce the depth in deep neural networks? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1402--1406, 2023

work page 2023

[18] [19]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36: 0 21702--21720, 2023

work page 2023

[19] [20]

Latent space translation via semantic alignment

Valentino Maiorca, Luca Moschella, Antonio Norelli, Marco Fumero, Francesco Locatello, and Emanuele Rodol \`a . Latent space translation via semantic alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[20] [21]

Insights on representational similarity in neural networks with canonical correlation

Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[21] [22]

Relative representations enable zero-shot latent space communication

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol \`a . Relative representations enable zero-shot latent space communication. In Proc. ICLR, 2023

work page 2023

[22] [23]

Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth

Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020

work page arXiv 2010

[23] [24]

Asif: Coupled data turns unimodal models to multimodal without training

Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodola, and Francesco Locatello. Asif: Coupled data turns unimodal models to multimodal without training. Advances in Neural Information Processing Systems, 36: 0 15303--15319, 2023

work page 2023

[24] [25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021

[26] [27]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017

work page 2017

[27] [28]

Berg and Li Fei-Fei , Title =

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV), 115 0 (3): 0 211--252, 2015. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[28] [29]

On the effect of dropping layers of pre-trained transformer models

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77: 0 101429, 2023

work page 2023

[29] [30]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models....

work page 2022

[30] [31]

Woodfisher: Efficient second-order approximation for neural network compression

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 18098--18109. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/202...

work page 2020

[31] [32]

You need multiple exiting: Dynamic early exiting for accelerating unified vision language model

Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, and Dongkuan Xu. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10781--10791, 2023

work page 2023

[32] [33]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. arxiv 2020. arXiv preprint arXiv:2012.12877, 2 0 (3), 2020

work page arXiv 2020

[33] [34]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[34] [35]

Convolutional networks with adaptive inference graphs

Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018

work page 2018

[35] [36]

Residual networks behave like ensembles of relatively shallow networks

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75...

work page 2016

[36] [37]

Skip-attention: Improving vision transformers by paying less attention

Shashanka Venkataramanan, Amir Ghodrati, Yuki M Asano, Fatih Porikli, and Amir Habibian. Skip-attention: Improving vision transformers by paying less attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vI95kcLAoU

work page 2024

[37] [38]

Practical network acceleration with tiny sets

Guo-Hua Wang and Jianxin Wu. Practical network acceleration with tiny sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[38] [39]

Davis, Kristen Grauman, and Rogerio Feris

Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[39] [40]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [41]

DeeBERT: Dynam ic Early Exiting for Accelerating BERT Inference,

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020

work page arXiv 2004

[41] [42]

Width & depth pruning for vision transformers

Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, and Li Cui. Width & depth pruning for vision transformers. In Proc. AAAI, 2022

work page 2022

[42] [43]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[43] [44]

Dense vision transformer compression with few samples

Hanxiao Zhang, Yifan Zhou, and Guo-Hua Wang. Dense vision transformer compression with few samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15825--15834, June 2024

work page 2024

[44] [45]

Accelerating training of transformer-based language models with progressive layer dropping

Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping. Advances in neural information processing systems, 33: 0 14011--14023, 2020

work page 2020

[45] [46]

Bert loses patience: Fast and robust inference with early exit

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33: 0 18330--18341, 2020

work page 2020

[46] [47]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[47] [48]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[48] [49]

, " * write output.state after.block = add.period write newline

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5281/zenodo.7083378 2022

[49] [50]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page