pith. sign in

arxiv: 2605.22677 · v1 · pith:E52PLSPZnew · submitted 2026-05-21 · 💻 cs.CV

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Pith reviewed 2026-05-22 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords slimmable networksConvNeXtwidth-adaptive inferenceefficient deploymentImageNetnested subnetworksLayerNorm
0
0 comments X

The pith

ConvNeXt's LayerNorm and inverted bottlenecks let one set of weights train multiple nested widths for accurate, compute-adaptive inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that modern ConvNeXt blocks support width-slimming without the switchable batch-norm layers required by earlier CNNs. Because LayerNorm and inverted bottlenecks remove normalization overhead, a single shared-weight network can be trained from scratch to contain multiple nested subnetworks of increasing capacity. On ImageNet-1k this yields 80.8 percent top-1 accuracy at 4.5 GMACs and 77.4 percent at 1.2 GMACs with only three subnetworks, exceeding HydraViT, MatFormer, and SortedNet at matching compute. The same weights therefore serve devices whose available compute varies because of battery, heat, or latency limits.

Core claim

Slimmable ConvNeXt trains one ConvNeXt model whose weights contain multiple nested subnetworks by jointly optimizing for different channel widths; LayerNorm and inverted bottlenecks remove the normalization overhead that forced switchable batch-norm layers in prior slimmable CNNs, producing a simpler training pipeline and higher accuracy than both earlier CNN and recent Vision Transformer approaches at comparable GMACs.

What carries the argument

Nested channel-width subnetworks inside ConvNeXt blocks, where each width is a strict subset of the next and all widths share the same weights while using LayerNorm in place of batch normalization.

If this is right

  • A single set of weights can be deployed across devices with different compute budgets or can adapt width on a single device as resources fluctuate.
  • No separate models need to be stored or switched at inference time.
  • The same training recipe scales from ConvNeXt-T to ConvNeXt-B, raising peak accuracy to 82.8 percent at 15.35 GMACs.
  • Inference latency can be tuned by selecting the appropriate width at runtime without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LayerNorm-enabled slimming technique could be applied to other modern CNN families that have already replaced batch normalization.
  • Real-device measurements of latency and energy under thermal throttling would test whether the GMAC figures translate to practical gains.
  • Combining width slimming with post-training quantization or pruning could produce even finer-grained efficiency ladders.

Load-bearing premise

ConvNeXt's LayerNorm and inverted-bottleneck blocks eliminate the normalization overhead that previously required switchable batch-norm layers, allowing joint training of multiple widths without accuracy loss.

What would settle it

Train a fixed-width ConvNeXt model at each target width separately and compare its top-1 accuracy on ImageNet-1k to the corresponding subnetwork extracted from the jointly trained slimmable model; a consistent gap larger than a few percentage points would falsify the claim that the simpler pipeline preserves performance.

Figures

Figures reproduced from arXiv: 2605.22677 by Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel.

Figure 1
Figure 1. Figure 1: ConvNeXt-T variants vs. ViT-S-based baselines on ImageNet-1k. Slimmable ConvNeXt-T with 3, 4, and 5 subnet￾works compared to HydraViT, SortedNet, DynaBERT, and Mat￾Former in the low-compute regime. All ConvNeXt-T variants achieve a higher accuracy at significantly lower GMACs. 1 arXiv:2605.22677v1 [cs.CV] 21 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A ConvNeXt block slimmed to p=0.5. All weight ten￾sors are sliced to retain the first ⌊p ·C⌋ channels, including the 4× expanded intermediate layer. The residual is zero-padded to match the input dimension. 2. Related Work Training a single set of shared weights containing multiple nested subnetworks of increasing capacity has a long his￾tory in CNNs. Slimmable Networks [33] train one model at multiple pre… view at source ↗
Figure 3
Figure 3. Figure 3: The ConvNeXt-Tiny architecture with 4 stages and 18 blocks total. Downsampling layers between stages halve spatial resolution [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy vs. GMACs for all Slimmable ConvNeXt variants and ViT-based baselines on ImageNet-1k. Slimmable ConvNeXt models across all three sizes achieve competitive or superior accuracy at lower GMACs compared to the ViT-based methods. izes the accuracy-compute trade-off across all model sizes and baselines. Our strongest configuration, Slimmable ConvNeXt-B with 4 subnetworks, achieves 82.8% top-1 accuracy … view at source ↗
Figure 5
Figure 5. Figure 5: Scaling comparison of Slimmable ConvNeXt across three ConvNeXt sizes. Each panel shows the accuracy-compute trade-off for a single model size with its subnetwork variants, compared to ViT-based baselines. retain 82.5%, matching the 3-subnetwork baseline. This ro￾bustness is consistent with the observation above that wider channel dimensions provide sufficient redundancy to absorb the cost of additional sub… view at source ↗
Figure 6
Figure 6. Figure 6: compares Slimmable ConvNeXt-T against clas￾sical CNN slimming methods: AutoSlim-ResNet-50 [31] and US-MobileNet v1/v2 [32]. The MobileNet-based methods are limited to a narrow GMACs range be￾low 0.6 GMACs due to their lightweight base architec￾tures, and AutoSlim-ResNet-50 reaches at most 76.0% at 3.0 GMACs. In contrast, Slimmable ConvNeXt-T cov￾ers 0.3 to 4.5 GMACs and reaches 80.8% at full capac￾ity, dem… view at source ↗
Figure 7
Figure 7. Figure 7: AutoSlim vs. uniform slimming on ConvNeXt-T with 3 and 4 subnetworks. AutoSlim provides modest improvements at low capacity for both subnetwork counts, while uniform slim￾ming retains a slight edge at full capacity. 0 1 2 3 4 GMACs 66 68 70 72 74 76 78 80 82 Accuracy [%] Slim-ConvNeXt-T (600 epochs) Slim-ConvNeXt-T (300 epochs) AutoSlim-ConvNeXt-T (600 epochs) AutoSlim-ConvNeXt-T (300 epochs) [PITH_FULL_I… view at source ↗
Figure 8
Figure 8. Figure 8: Training duration and AutoSlim on ConvNeXt-T (3 subnetworks). Standard and AutoSlim models at 300 and 600 epochs. Longer training benefits all operating points, especially at low capacity. ConvNeXt-T with 3 subnetworks. Extending training from 300 to 600 epochs improves accuracy for all subnetworks, with the largest gains at low capacity (65.4% to 66.9% at quarter capacity vs. 79.8% to 80.8% at full capaci… view at source ↗
read the original abstract

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Slimmable ConvNeXt, a width-adaptive CNN that exploits ConvNeXt's LayerNorm and inverted-bottleneck blocks to train a single set of weights containing multiple nested subnetworks without requiring switchable batch-normalization layers. On ImageNet-1k, the Tiny variant with three subnetworks is reported to reach 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs after 600 epochs of training from scratch, exceeding HydraViT (6-head: 78.4% at 4.6 GMACs; 3-head: 73.0% at 1.3 GMACs), MatFormer-S (78.6%), and SortedNet-S (78.2%) at comparable compute; the Base variant further reaches 82.8% at 15.35 GMACs.

Significance. If the reported gains can be shown to stem from the architectural simplifications rather than training-budget differences, the work would provide a practical route to efficient multi-device and dynamic-compute deployment. By demonstrating that a modern CNN backbone supports nested width-adaptive subnetworks with a simpler pipeline than either classical slimmable CNNs or recent ViT-based methods, it reduces engineering overhead for adaptive inference while delivering concrete accuracy-versus-GMACs improvements on a public benchmark.

major comments (1)
  1. [Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.
minor comments (1)
  1. [Abstract] Abstract: the reported top-1 accuracies are given as point estimates without error bars, standard deviations, or the number of independent runs, making it difficult to judge whether the observed margins over baselines are statistically reliable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback and the opportunity to strengthen the clarity of our comparisons. We address the major comment on training schedules point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.

    Authors: We agree that the training schedules and epoch counts for the baselines must be explicitly documented to support fair attribution of gains. In the revised manuscript we will add a new table (or expanded paragraph in Section 4) that reports the exact training epochs and schedules used in the original HydraViT, MatFormer, and SortedNet papers, as well as for standard ConvNeXt. Our Slimmable ConvNeXt models were trained for 600 epochs because the joint optimization of multiple nested widths typically requires additional steps for convergence; we will qualify the abstract and experimental claims accordingly, noting that part of the observed margin may be attributable to training budget while still highlighting the architectural simplification (no switchable BN, simpler pipeline) as the primary enabler of the width-adaptive behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and independent training

full rationale

The paper reports empirical top-1 accuracies for Slimmable ConvNeXt variants on ImageNet-1k and compares them directly to external baselines (HydraViT, MatFormer, SortedNet). No equations, derivations, or fitted parameters are presented that reduce the reported accuracies or architectural claims to quantities defined by the paper's own inputs. The central argument—that LayerNorm and inverted bottlenecks enable simpler slimming without switchable BN—is supported by the training procedure and results rather than by self-definition or self-citation chains. Training schedule differences (600 epochs) raise a separate fairness concern but do not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised ImageNet training assumptions and the architectural properties of ConvNeXt as previously published; no new entities are postulated.

free parameters (1)
  • subnetwork widths
    Specific channel widths corresponding to the reported 4.5 GMAC and 1.2 GMAC operating points are chosen by the authors to produce the desired compute-accuracy trade-offs.
axioms (1)
  • domain assumption Standard ImageNet-1k supervised training protocol (data augmentation, optimizer, learning-rate schedule) applies without modification.
    The 600-epoch training result assumes the usual large-scale vision training setup.

pith-pipeline@v0.9.0 · 5832 in / 1323 out tokens · 52773 ms · 2026-05-22T05:56:27.359690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

  1. [1]

    Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024

    Haitham Afifi, Sabrina Pochaba, Andreas Boltres, Dominic Laniewski, Janek Haberer, Leonard Paeleke, Reza Poorzare, Daniel Stolpmann, Nikolas Wehner, Adrian Redder, Eric Samikwa, and Michael Seufert. Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024. 1

  2. [2]

    Once for all: Train one network and specialize it for efficient deployment

    Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. InInternational Conference on Learning Representations, 2020. 1, 2

  3. [3]

    Flextron: Many-in-one flexible large language model

    Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model. InIn- ternational Conference on Machine Learning, 2024. 4

  4. [4]

    Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space

    Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022. 4

  5. [5]

    Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V . Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 4, 5

  6. [6]

    ImageNet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 5, 8

  7. [7]

    Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024

    Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Ha- jishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024. 2, 4, 5, 6, 8

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2021. 2, 5

  9. [9]

    NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision

    Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision. InProceedings of the 24th Annual Inter- national Conference on Mobile Computing and Networking, pages 115–127, 2018. 1, 2

  10. [10]

    HydraViT: Stacking Heads for a Scalable ViT

    Janek Haberer, Ali Hojjat, and Olaf Landsiedel. HydraViT: Stacking Heads for a Scalable ViT. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

  13. [13]

    ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025

    Ali Hojjat, Janek Haberer, Soren Pirk, and Olaf Landsiedel. ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025. 3

  14. [14]

    DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020

    Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020. 2, 4, 5, 6, 8

  15. [15]

    Weinberger

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. 4, 5

  16. [16]

    Weinberger

    Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Multi-scale dense networks for resource efficient image classification. InInternational Conference on Learning Representations,

  17. [17]

    Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022. 2

  18. [18]

    Ef- ficientFormer: Vision transformers at MobileNet speed

    Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evan- gelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Ef- ficientFormer: Vision transformers at MobileNet speed. Advances in Neural Information Processing Systems, 35: 12934–12949, 2022. 7

  19. [19]

    Swin Transformer V2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin Transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 3

  20. [20]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

  21. [21]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4, 5

  22. [22]

    ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design. InEuropean Conference on Computer Vision, pages 116–131, 2018. 8 9

  23. [23]

    Learning ordered representations with nested dropout

    Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. InInternational Conference on Machine Learning, pages 1746–1754. PMLR,

  24. [24]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 5

  25. [25]

    MobileNetV2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 3

  26. [26]

    Rethinking the Inception ar- chitecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception ar- chitecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. 5

  27. [27]

    Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InInterna- tional Conference on Machine Learning, pages 6105–6114. PMLR, 2019. 1, 3, 6

  28. [28]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 2

  29. [29]

    SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

    Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Ra- jabzadeh, Marzieh Tahaei, Boxing Chen, and Ali Ghodsi. SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,

  30. [30]

    PyTorch image models.https : / / github

    Ross Wightman. PyTorch image models.https : / / github . com / rwightman / pytorch - image - models, 2019. 5

  31. [31]

    AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

    Jiahui Yu and Thomas Huang. AutoSlim: Towards One- Shot Architecture Search for Channel Numbers, 2019. arXiv:1903.11728 [cs]. 2, 4, 5, 7, 8

  32. [32]

    Universally slimmable net- works and improved training techniques

    Jiahui Yu and Thomas S Huang. Universally slimmable net- works and improved training techniques. InProceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019. 2, 3, 5, 7, 8

  33. [33]

    Slimmable neural networks

    Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. InInterna- tional Conference on Learning Representations, 2019. 2, 3, 4, 8

  34. [34]

    InceptionNeXt: When inception meets ConvNeXt

    Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. InceptionNeXt: When inception meets ConvNeXt. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 8

  35. [35]

    Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 4, 5

  36. [36]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions, 2018. 4, 5

  37. [37]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 4, 5 10