Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Janek Haberer; Jon Eike Wilhelm; Olaf Landsiedel

arxiv: 2605.22677 · v1 · pith:E52PLSPZnew · submitted 2026-05-21 · 💻 cs.CV

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Janek Haberer , Jon Eike Wilhelm , Olaf Landsiedel This is my paper

Pith reviewed 2026-05-22 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords slimmable networksConvNeXtwidth-adaptive inferenceefficient deploymentImageNetnested subnetworksLayerNorm

0 comments

The pith

ConvNeXt's LayerNorm and inverted bottlenecks let one set of weights train multiple nested widths for accurate, compute-adaptive inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that modern ConvNeXt blocks support width-slimming without the switchable batch-norm layers required by earlier CNNs. Because LayerNorm and inverted bottlenecks remove normalization overhead, a single shared-weight network can be trained from scratch to contain multiple nested subnetworks of increasing capacity. On ImageNet-1k this yields 80.8 percent top-1 accuracy at 4.5 GMACs and 77.4 percent at 1.2 GMACs with only three subnetworks, exceeding HydraViT, MatFormer, and SortedNet at matching compute. The same weights therefore serve devices whose available compute varies because of battery, heat, or latency limits.

Core claim

Slimmable ConvNeXt trains one ConvNeXt model whose weights contain multiple nested subnetworks by jointly optimizing for different channel widths; LayerNorm and inverted bottlenecks remove the normalization overhead that forced switchable batch-norm layers in prior slimmable CNNs, producing a simpler training pipeline and higher accuracy than both earlier CNN and recent Vision Transformer approaches at comparable GMACs.

What carries the argument

Nested channel-width subnetworks inside ConvNeXt blocks, where each width is a strict subset of the next and all widths share the same weights while using LayerNorm in place of batch normalization.

If this is right

A single set of weights can be deployed across devices with different compute budgets or can adapt width on a single device as resources fluctuate.
No separate models need to be stored or switched at inference time.
The same training recipe scales from ConvNeXt-T to ConvNeXt-B, raising peak accuracy to 82.8 percent at 15.35 GMACs.
Inference latency can be tuned by selecting the appropriate width at runtime without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LayerNorm-enabled slimming technique could be applied to other modern CNN families that have already replaced batch normalization.
Real-device measurements of latency and energy under thermal throttling would test whether the GMAC figures translate to practical gains.
Combining width slimming with post-training quantization or pruning could produce even finer-grained efficiency ladders.

Load-bearing premise

ConvNeXt's LayerNorm and inverted-bottleneck blocks eliminate the normalization overhead that previously required switchable batch-norm layers, allowing joint training of multiple widths without accuracy loss.

What would settle it

Train a fixed-width ConvNeXt model at each target width separately and compare its top-1 accuracy on ImageNet-1k to the corresponding subnetwork extracted from the jointly trained slimmable model; a consistent gap larger than a few percentage points would falsify the claim that the simpler pipeline preserves performance.

Figures

Figures reproduced from arXiv: 2605.22677 by Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel.

**Figure 1.** Figure 1: ConvNeXt-T variants vs. ViT-S-based baselines on ImageNet-1k. Slimmable ConvNeXt-T with 3, 4, and 5 subnetworks compared to HydraViT, SortedNet, DynaBERT, and MatFormer in the low-compute regime. All ConvNeXt-T variants achieve a higher accuracy at significantly lower GMACs. 1 arXiv:2605.22677v1 [cs.CV] 21 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A ConvNeXt block slimmed to p=0.5. All weight tensors are sliced to retain the first ⌊p ·C⌋ channels, including the 4× expanded intermediate layer. The residual is zero-padded to match the input dimension. 2. Related Work Training a single set of shared weights containing multiple nested subnetworks of increasing capacity has a long history in CNNs. Slimmable Networks [33] train one model at multiple pre… view at source ↗

**Figure 3.** Figure 3: The ConvNeXt-Tiny architecture with 4 stages and 18 blocks total. Downsampling layers between stages halve spatial resolution [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy vs. GMACs for all Slimmable ConvNeXt variants and ViT-based baselines on ImageNet-1k. Slimmable ConvNeXt models across all three sizes achieve competitive or superior accuracy at lower GMACs compared to the ViT-based methods. izes the accuracy-compute trade-off across all model sizes and baselines. Our strongest configuration, Slimmable ConvNeXt-B with 4 subnetworks, achieves 82.8% top-1 accuracy … view at source ↗

**Figure 5.** Figure 5: Scaling comparison of Slimmable ConvNeXt across three ConvNeXt sizes. Each panel shows the accuracy-compute trade-off for a single model size with its subnetwork variants, compared to ViT-based baselines. retain 82.5%, matching the 3-subnetwork baseline. This robustness is consistent with the observation above that wider channel dimensions provide sufficient redundancy to absorb the cost of additional sub… view at source ↗

**Figure 6.** Figure 6: compares Slimmable ConvNeXt-T against classical CNN slimming methods: AutoSlim-ResNet-50 [31] and US-MobileNet v1/v2 [32]. The MobileNet-based methods are limited to a narrow GMACs range below 0.6 GMACs due to their lightweight base architectures, and AutoSlim-ResNet-50 reaches at most 76.0% at 3.0 GMACs. In contrast, Slimmable ConvNeXt-T covers 0.3 to 4.5 GMACs and reaches 80.8% at full capacity, dem… view at source ↗

**Figure 7.** Figure 7: AutoSlim vs. uniform slimming on ConvNeXt-T with 3 and 4 subnetworks. AutoSlim provides modest improvements at low capacity for both subnetwork counts, while uniform slimming retains a slight edge at full capacity. 0 1 2 3 4 GMACs 66 68 70 72 74 76 78 80 82 Accuracy [%] Slim-ConvNeXt-T (600 epochs) Slim-ConvNeXt-T (300 epochs) AutoSlim-ConvNeXt-T (600 epochs) AutoSlim-ConvNeXt-T (300 epochs) [PITH_FULL_I… view at source ↗

**Figure 8.** Figure 8: Training duration and AutoSlim on ConvNeXt-T (3 subnetworks). Standard and AutoSlim models at 300 and 600 epochs. Longer training benefits all operating points, especially at low capacity. ConvNeXt-T with 3 subnetworks. Extending training from 300 to 600 epochs improves accuracy for all subnetworks, with the largest gains at low capacity (65.4% to 66.9% at quarter capacity vs. 79.8% to 80.8% at full capaci… view at source ↗

read the original abstract

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConvNeXt's LayerNorm and inverted bottlenecks simplify width slimming without switchable BN, but the 600-epoch schedule may explain part of the reported gains over baselines.

read the letter

ConvNeXt's LayerNorm and inverted bottlenecks simplify width slimming without switchable BN, but the 600-epoch schedule may explain part of the reported gains over baselines. The paper shows that these design choices remove the normalization overhead that older slimmable CNNs had to handle with extra layers, leading to a cleaner shared-weight training pipeline for multiple nested widths. They demonstrate this on ConvNeXt-T with three subnetworks, hitting 80.8% top-1 at 4.5 GMACs and 77.4% at 1.2 GMACs on ImageNet-1k, which sits above the HydraViT, MatFormer-S, and SortedNet-S numbers at comparable compute. The base variant scales to 82.8% at the high end. This is a practical step for anyone who wants one set of weights that can run at different capacities on varying hardware or under changing load conditions. The direct comparisons on a public dataset give a clear picture of where the method lands relative to recent alternatives. The main soft spot is the training budget. Six hundred epochs from scratch is longer than the 300-epoch standard used for most ConvNeXt and slimmable baselines. If the cited numbers for HydraViT and the others come from their original shorter schedules, some of the 2-4 point margins could come from extra optimization steps rather than the architectural fit alone. The abstract does not clarify whether the baselines were re-run under the same conditions or provide ablations on epoch count, so the causal link between the ConvNeXt choices and the accuracy lift is not fully pinned down. No error bars or variance numbers are mentioned either, which leaves the stability of the subnetwork results a bit open. This work is aimed at practitioners in efficient vision deployment who need width-adaptive inference without maintaining separate models. Readers working on multi-device or variable-compute scenarios will get a usable recipe and concrete numbers to build from. The core idea is straightforward and testable on standard benchmarks, so it deserves a serious referee even if the schedule parity and variance details need tightening in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Slimmable ConvNeXt, a width-adaptive CNN that exploits ConvNeXt's LayerNorm and inverted-bottleneck blocks to train a single set of weights containing multiple nested subnetworks without requiring switchable batch-normalization layers. On ImageNet-1k, the Tiny variant with three subnetworks is reported to reach 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs after 600 epochs of training from scratch, exceeding HydraViT (6-head: 78.4% at 4.6 GMACs; 3-head: 73.0% at 1.3 GMACs), MatFormer-S (78.6%), and SortedNet-S (78.2%) at comparable compute; the Base variant further reaches 82.8% at 15.35 GMACs.

Significance. If the reported gains can be shown to stem from the architectural simplifications rather than training-budget differences, the work would provide a practical route to efficient multi-device and dynamic-compute deployment. By demonstrating that a modern CNN backbone supports nested width-adaptive subnetworks with a simpler pipeline than either classical slimmable CNNs or recent ViT-based methods, it reduces engineering overhead for adaptive inference while delivering concrete accuracy-versus-GMACs improvements on a public benchmark.

major comments (1)

[Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.

minor comments (1)

[Abstract] Abstract: the reported top-1 accuracies are given as point estimates without error bars, standard deviations, or the number of independent runs, making it difficult to judge whether the observed margins over baselines are statistically reliable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback and the opportunity to strengthen the clarity of our comparisons. We address the major comment on training schedules point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.

Authors: We agree that the training schedules and epoch counts for the baselines must be explicitly documented to support fair attribution of gains. In the revised manuscript we will add a new table (or expanded paragraph in Section 4) that reports the exact training epochs and schedules used in the original HydraViT, MatFormer, and SortedNet papers, as well as for standard ConvNeXt. Our Slimmable ConvNeXt models were trained for 600 epochs because the joint optimization of multiple nested widths typically requires additional steps for convergence; we will qualify the abstract and experimental claims accordingly, noting that part of the observed margin may be attributable to training budget while still highlighting the architectural simplification (no switchable BN, simpler pipeline) as the primary enabler of the width-adaptive behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and independent training

full rationale

The paper reports empirical top-1 accuracies for Slimmable ConvNeXt variants on ImageNet-1k and compares them directly to external baselines (HydraViT, MatFormer, SortedNet). No equations, derivations, or fitted parameters are presented that reduce the reported accuracies or architectural claims to quantities defined by the paper's own inputs. The central argument—that LayerNorm and inverted bottlenecks enable simpler slimming without switchable BN—is supported by the training procedure and results rather than by self-definition or self-citation chains. Training schedule differences (600 epochs) raise a separate fairness concern but do not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised ImageNet training assumptions and the architectural properties of ConvNeXt as previously published; no new entities are postulated.

free parameters (1)

subnetwork widths
Specific channel widths corresponding to the reported 4.5 GMAC and 1.2 GMAC operating points are chosen by the authors to produce the desired compute-accuracy trade-offs.

axioms (1)

domain assumption Standard ImageNet-1k supervised training protocol (data augmentation, optimizer, learning-rate schedule) applies without modification.
The 600-epoch training result assumes the usual large-scale vision training setup.

pith-pipeline@v0.9.0 · 5832 in / 1323 out tokens · 52773 ms · 2026-05-22T05:56:27.359690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024

Haitham Afifi, Sabrina Pochaba, Andreas Boltres, Dominic Laniewski, Janek Haberer, Leonard Paeleke, Reza Poorzare, Daniel Stolpmann, Nikolas Wehner, Adrian Redder, Eric Samikwa, and Michael Seufert. Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024. 1

work page 2024
[2]

Once for all: Train one network and specialize it for efficient deployment

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. InInternational Conference on Learning Representations, 2020. 1, 2

work page 2020
[3]

Flextron: Many-in-one flexible large language model

Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model. InIn- ternational Conference on Machine Learning, 2024. 4

work page 2024
[4]

Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space

Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022. 4

work page 2022
[5]

Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V . Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 4, 5

work page 2020
[6]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 5, 8

work page 2009
[7]

Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024

Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Ha- jishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024. 2, 4, 5, 6, 8

work page 2024
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2021. 2, 5

work page 2021
[9]

NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision

Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision. InProceedings of the 24th Annual Inter- national Conference on Mobile Computing and Networking, pages 115–127, 2018. 1, 2

work page 2018
[10]

HydraViT: Stacking Heads for a Scalable ViT

Janek Haberer, Ali Hojjat, and Olaf Landsiedel. HydraViT: Stacking Heads for a Scalable ViT. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

work page 2016
[12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025

Ali Hojjat, Janek Haberer, Soren Pirk, and Olaf Landsiedel. ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025. 3

work page arXiv 2025
[14]

DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020. 2, 4, 5, 6, 8

work page 2020
[15]

Weinberger

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. 4, 5

work page 2016
[16]

Weinberger

Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Multi-scale dense networks for resource efficient image classification. InInternational Conference on Learning Representations,

work page
[17]

Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022. 2

work page 2022
[18]

Ef- ficientFormer: Vision transformers at MobileNet speed

Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evan- gelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Ef- ficientFormer: Vision transformers at MobileNet speed. Advances in Neural Information Processing Systems, 35: 12934–12949, 2022. 7

work page 2022
[19]

Swin Transformer V2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin Transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 3

work page 2022
[20]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page
[21]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4, 5

work page 2019
[22]

ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design. InEuropean Conference on Computer Vision, pages 116–131, 2018. 8 9

work page 2018
[23]

Learning ordered representations with nested dropout

Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. InInternational Conference on Machine Learning, pages 1746–1754. PMLR,

work page
[24]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 5

work page 2015
[25]

MobileNetV2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 3

work page 2018
[26]

Rethinking the Inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception ar- chitecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. 5

work page 2016
[27]

Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InInterna- tional Conference on Machine Learning, pages 6105–6114. PMLR, 2019. 1, 3, 6

work page 2019
[28]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 2

work page 2021
[29]

SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Ra- jabzadeh, Marzieh Tahaei, Boxing Chen, and Ali Ghodsi. SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

work page arXiv
[30]

PyTorch image models.https : / / github

Ross Wightman. PyTorch image models.https : / / github . com / rwightman / pytorch - image - models, 2019. 5

work page 2019
[31]

AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

Jiahui Yu and Thomas Huang. AutoSlim: Towards One- Shot Architecture Search for Channel Numbers, 2019. arXiv:1903.11728 [cs]. 2, 4, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Universally slimmable net- works and improved training techniques

Jiahui Yu and Thomas S Huang. Universally slimmable net- works and improved training techniques. InProceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019. 2, 3, 5, 7, 8

work page 2019
[33]

Slimmable neural networks

Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. InInterna- tional Conference on Learning Representations, 2019. 2, 3, 4, 8

work page 2019
[34]

InceptionNeXt: When inception meets ConvNeXt

Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. InceptionNeXt: When inception meets ConvNeXt. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 8

work page 2024
[35]

Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 4, 5

work page 2019
[36]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions, 2018. 4, 5

work page 2018
[37]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 4, 5 10

work page 2020

[1] [1]

Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024

Haitham Afifi, Sabrina Pochaba, Andreas Boltres, Dominic Laniewski, Janek Haberer, Leonard Paeleke, Reza Poorzare, Daniel Stolpmann, Nikolas Wehner, Adrian Redder, Eric Samikwa, and Michael Seufert. Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024. 1

work page 2024

[2] [2]

Once for all: Train one network and specialize it for efficient deployment

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. InInternational Conference on Learning Representations, 2020. 1, 2

work page 2020

[3] [3]

Flextron: Many-in-one flexible large language model

Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model. InIn- ternational Conference on Machine Learning, 2024. 4

work page 2024

[4] [4]

Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space

Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022. 4

work page 2022

[5] [5]

Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V . Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 4, 5

work page 2020

[6] [6]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 5, 8

work page 2009

[7] [7]

Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024

Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Ha- jishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024. 2, 4, 5, 6, 8

work page 2024

[8] [8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2021. 2, 5

work page 2021

[9] [9]

NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision

Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision. InProceedings of the 24th Annual Inter- national Conference on Mobile Computing and Networking, pages 115–127, 2018. 1, 2

work page 2018

[10] [10]

HydraViT: Stacking Heads for a Scalable ViT

Janek Haberer, Ali Hojjat, and Olaf Landsiedel. HydraViT: Stacking Heads for a Scalable ViT. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page

[11] [11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

work page 2016

[12] [12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025

Ali Hojjat, Janek Haberer, Soren Pirk, and Olaf Landsiedel. ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025. 3

work page arXiv 2025

[14] [14]

DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020. 2, 4, 5, 6, 8

work page 2020

[15] [15]

Weinberger

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. 4, 5

work page 2016

[16] [16]

Weinberger

Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Multi-scale dense networks for resource efficient image classification. InInternational Conference on Learning Representations,

work page

[17] [17]

Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022. 2

work page 2022

[18] [18]

Ef- ficientFormer: Vision transformers at MobileNet speed

Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evan- gelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Ef- ficientFormer: Vision transformers at MobileNet speed. Advances in Neural Information Processing Systems, 35: 12934–12949, 2022. 7

work page 2022

[19] [19]

Swin Transformer V2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin Transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 3

work page 2022

[20] [20]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page

[21] [21]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4, 5

work page 2019

[22] [22]

ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design. InEuropean Conference on Computer Vision, pages 116–131, 2018. 8 9

work page 2018

[23] [23]

Learning ordered representations with nested dropout

Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. InInternational Conference on Machine Learning, pages 1746–1754. PMLR,

work page

[24] [24]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 5

work page 2015

[25] [25]

MobileNetV2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 3

work page 2018

[26] [26]

Rethinking the Inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception ar- chitecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. 5

work page 2016

[27] [27]

Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InInterna- tional Conference on Machine Learning, pages 6105–6114. PMLR, 2019. 1, 3, 6

work page 2019

[28] [28]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 2

work page 2021

[29] [29]

SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Ra- jabzadeh, Marzieh Tahaei, Boxing Chen, and Ali Ghodsi. SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

work page arXiv

[30] [30]

PyTorch image models.https : / / github

Ross Wightman. PyTorch image models.https : / / github . com / rwightman / pytorch - image - models, 2019. 5

work page 2019

[31] [31]

AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

Jiahui Yu and Thomas Huang. AutoSlim: Towards One- Shot Architecture Search for Channel Numbers, 2019. arXiv:1903.11728 [cs]. 2, 4, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [32]

Universally slimmable net- works and improved training techniques

Jiahui Yu and Thomas S Huang. Universally slimmable net- works and improved training techniques. InProceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019. 2, 3, 5, 7, 8

work page 2019

[33] [33]

Slimmable neural networks

Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. InInterna- tional Conference on Learning Representations, 2019. 2, 3, 4, 8

work page 2019

[34] [34]

InceptionNeXt: When inception meets ConvNeXt

Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. InceptionNeXt: When inception meets ConvNeXt. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 8

work page 2024

[35] [35]

Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 4, 5

work page 2019

[36] [36]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions, 2018. 4, 5

work page 2018

[37] [37]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 4, 5 10

work page 2020