Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
Pith reviewed 2026-05-22 05:56 UTC · model grok-4.3
The pith
ConvNeXt's LayerNorm and inverted bottlenecks let one set of weights train multiple nested widths for accurate, compute-adaptive inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Slimmable ConvNeXt trains one ConvNeXt model whose weights contain multiple nested subnetworks by jointly optimizing for different channel widths; LayerNorm and inverted bottlenecks remove the normalization overhead that forced switchable batch-norm layers in prior slimmable CNNs, producing a simpler training pipeline and higher accuracy than both earlier CNN and recent Vision Transformer approaches at comparable GMACs.
What carries the argument
Nested channel-width subnetworks inside ConvNeXt blocks, where each width is a strict subset of the next and all widths share the same weights while using LayerNorm in place of batch normalization.
If this is right
- A single set of weights can be deployed across devices with different compute budgets or can adapt width on a single device as resources fluctuate.
- No separate models need to be stored or switched at inference time.
- The same training recipe scales from ConvNeXt-T to ConvNeXt-B, raising peak accuracy to 82.8 percent at 15.35 GMACs.
- Inference latency can be tuned by selecting the appropriate width at runtime without retraining.
Where Pith is reading between the lines
- The same LayerNorm-enabled slimming technique could be applied to other modern CNN families that have already replaced batch normalization.
- Real-device measurements of latency and energy under thermal throttling would test whether the GMAC figures translate to practical gains.
- Combining width slimming with post-training quantization or pruning could produce even finer-grained efficiency ladders.
Load-bearing premise
ConvNeXt's LayerNorm and inverted-bottleneck blocks eliminate the normalization overhead that previously required switchable batch-norm layers, allowing joint training of multiple widths without accuracy loss.
What would settle it
Train a fixed-width ConvNeXt model at each target width separately and compare its top-1 accuracy on ImageNet-1k to the corresponding subnetwork extracted from the jointly trained slimmable model; a consistent gap larger than a few percentage points would falsify the claim that the simpler pipeline preserves performance.
Figures
read the original abstract
Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Slimmable ConvNeXt, a width-adaptive CNN that exploits ConvNeXt's LayerNorm and inverted-bottleneck blocks to train a single set of weights containing multiple nested subnetworks without requiring switchable batch-normalization layers. On ImageNet-1k, the Tiny variant with three subnetworks is reported to reach 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs after 600 epochs of training from scratch, exceeding HydraViT (6-head: 78.4% at 4.6 GMACs; 3-head: 73.0% at 1.3 GMACs), MatFormer-S (78.6%), and SortedNet-S (78.2%) at comparable compute; the Base variant further reaches 82.8% at 15.35 GMACs.
Significance. If the reported gains can be shown to stem from the architectural simplifications rather than training-budget differences, the work would provide a practical route to efficient multi-device and dynamic-compute deployment. By demonstrating that a modern CNN backbone supports nested width-adaptive subnetworks with a simpler pipeline than either classical slimmable CNNs or recent ViT-based methods, it reduces engineering overhead for adaptive inference while delivering concrete accuracy-versus-GMACs improvements on a public benchmark.
major comments (1)
- [Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.
minor comments (1)
- [Abstract] Abstract: the reported top-1 accuracies are given as point estimates without error bars, standard deviations, or the number of independent runs, making it difficult to judge whether the observed margins over baselines are statistically reliable.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback and the opportunity to strengthen the clarity of our comparisons. We address the major comment on training schedules point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that ConvNeXt's LayerNorm and inverted bottlenecks enable superior width-adaptive performance rests on direct comparisons to HydraViT, MatFormer-S, and SortedNet-S. The manuscript states that Slimmable ConvNeXt-T was trained from scratch for 600 epochs, yet does not specify the training schedules used for the cited baselines (standard ConvNeXt and many slimmable/ViT baselines use 300 epochs). This mismatch is load-bearing for the causal attribution of the 2.4–4.4 pp margins to the architectural changes rather than to additional optimization steps.
Authors: We agree that the training schedules and epoch counts for the baselines must be explicitly documented to support fair attribution of gains. In the revised manuscript we will add a new table (or expanded paragraph in Section 4) that reports the exact training epochs and schedules used in the original HydraViT, MatFormer, and SortedNet papers, as well as for standard ConvNeXt. Our Slimmable ConvNeXt models were trained for 600 epochs because the joint optimization of multiple nested widths typically requires additional steps for convergence; we will qualify the abstract and experimental claims accordingly, noting that part of the observed margin may be attributable to training budget while still highlighting the architectural simplification (no switchable BN, simpler pipeline) as the primary enabler of the width-adaptive behavior. revision: yes
Circularity Check
No circularity: empirical results rest on external benchmarks and independent training
full rationale
The paper reports empirical top-1 accuracies for Slimmable ConvNeXt variants on ImageNet-1k and compares them directly to external baselines (HydraViT, MatFormer, SortedNet). No equations, derivations, or fitted parameters are presented that reduce the reported accuracies or architectural claims to quantities defined by the paper's own inputs. The central argument—that LayerNorm and inverted bottlenecks enable simpler slimming without switchable BN—is supported by the training procedure and results rather than by self-definition or self-citation chains. Training schedule differences (600 epochs) raise a separate fairness concern but do not constitute circularity under the defined patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- subnetwork widths
axioms (1)
- domain assumption Standard ImageNet-1k supervised training protocol (data augmentation, optimizer, learning-rate schedule) applies without modification.
Reference graph
Works this paper leans on
-
[1]
Haitham Afifi, Sabrina Pochaba, Andreas Boltres, Dominic Laniewski, Janek Haberer, Leonard Paeleke, Reza Poorzare, Daniel Stolpmann, Nikolas Wehner, Adrian Redder, Eric Samikwa, and Michael Seufert. Machine learning with com- puter networks: Techniques, datasets, and models.IEEE Ac- cess, 12:54673–54720, 2024. 1
work page 2024
-
[2]
Once for all: Train one network and specialize it for efficient deployment
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. InInternational Conference on Learning Representations, 2020. 1, 2
work page 2020
-
[3]
Flextron: Many-in-one flexible large language model
Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model. InIn- ternational Conference on Machine Learning, 2024. 4
work page 2024
-
[4]
Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space
Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slim- ming: Multi-dimension searching in continuous optimiza- tion space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022. 4
work page 2022
-
[5]
Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V
Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V . Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 4, 5
work page 2020
-
[6]
ImageNet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 5, 8
work page 2009
-
[7]
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Ha- jishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Mat- Former: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535– 140564, 2024. 2, 4, 5, 6, 8
work page 2024
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2021. 2, 5
work page 2021
-
[9]
NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision
Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource- Aware Multi-Tenant On-Device Deep Learning for Continu- ous Mobile Vision. InProceedings of the 24th Annual Inter- national Conference on Mobile Computing and Networking, pages 115–127, 2018. 1, 2
work page 2018
-
[10]
HydraViT: Stacking Heads for a Scalable ViT
Janek Haberer, Ali Hojjat, and Olaf Landsiedel. HydraViT: Stacking Heads for a Scalable ViT. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,
-
[11]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3
work page 2016
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Ali Hojjat, Janek Haberer, Soren Pirk, and Olaf Landsiedel. ThinkingViT: Matryoshka thinking vision transformer for elastic inference.arXiv preprint arXiv:2507.10800, 2025. 3
-
[14]
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. DynaBERT: Dynamic BERT with adaptive width and depth.Advances in Neural Information Processing Systems, 33:9782–9793, 2020. 2, 4, 5, 6, 8
work page 2020
-
[15]
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. 4, 5
work page 2016
-
[16]
Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Multi-scale dense networks for resource efficient image classification. InInternational Conference on Learning Representations,
-
[17]
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neu- ral Information Processing Systems, 35:30233–30249, 2022. 2
work page 2022
-
[18]
Ef- ficientFormer: Vision transformers at MobileNet speed
Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evan- gelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Ef- ficientFormer: Vision transformers at MobileNet speed. Advances in Neural Information Processing Systems, 35: 12934–12949, 2022. 7
work page 2022
-
[19]
Swin Transformer V2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin Transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 3
work page 2022
-
[20]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,
-
[21]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4, 5
work page 2019
-
[22]
ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient CNN archi- tecture design. InEuropean Conference on Computer Vision, pages 116–131, 2018. 8 9
work page 2018
-
[23]
Learning ordered representations with nested dropout
Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. InInternational Conference on Machine Learning, pages 1746–1754. PMLR,
-
[24]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 5
work page 2015
-
[25]
MobileNetV2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 3
work page 2018
-
[26]
Rethinking the Inception ar- chitecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception ar- chitecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. 5
work page 2016
-
[27]
Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InInterna- tional Conference on Machine Learning, pages 6105–6114. PMLR, 2019. 1, 3, 6
work page 2019
-
[28]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 2
work page 2021
-
[29]
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Ra- jabzadeh, Marzieh Tahaei, Boxing Chen, and Ali Ghodsi. SortedNet, a place for every network and every network in its place: Towards a generalized solution for training many- in-one neural networks.arXiv preprint arXiv:2309.00255,
-
[30]
PyTorch image models.https : / / github
Ross Wightman. PyTorch image models.https : / / github . com / rwightman / pytorch - image - models, 2019. 5
work page 2019
-
[31]
AutoSlim: Towards One-Shot Architecture Search for Channel Numbers
Jiahui Yu and Thomas Huang. AutoSlim: Towards One- Shot Architecture Search for Channel Numbers, 2019. arXiv:1903.11728 [cs]. 2, 4, 5, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[32]
Universally slimmable net- works and improved training techniques
Jiahui Yu and Thomas S Huang. Universally slimmable net- works and improved training techniques. InProceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019. 2, 3, 5, 7, 8
work page 2019
-
[33]
Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. InInterna- tional Conference on Learning Representations, 2019. 2, 3, 4, 8
work page 2019
-
[34]
InceptionNeXt: When inception meets ConvNeXt
Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. InceptionNeXt: When inception meets ConvNeXt. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 8
work page 2024
-
[35]
Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 4, 5
work page 2019
-
[36]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions, 2018. 4, 5
work page 2018
-
[37]
Random erasing data augmentation
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 4, 5 10
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.