Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

Adrien Deli\`ege; Marc Van Droogenbroeck; Maxim Henry; S\'ebastien Pi\'erard

arxiv: 2605.23656 · v1 · pith:WWZSO3N6new · submitted 2026-05-22 · 💻 cs.CV

Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

Maxim Henry , Adrien Deli\`ege , S\'ebastien Pi\'erard , Marc Van Droogenbroeck This is my paper

Pith reviewed 2026-05-25 04:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords RBDCblock-diagonal couplingvision transformersResNetImageNettraining efficiencymodel growthDeiT

0 comments

The pith

Recursive block-diagonal coupling of narrower models reduces vision model training FLOPs by 30% at matched accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an RBDC training protocol that builds wide vision models by recursively coupling narrower models trained independently through a parameter-free block-diagonal method. This protocol is tested on DeiT vision transformers and ResNet convolutional networks for ImageNet classification. It reports better efficiency than standard from-scratch training or existing model growth approaches, with concrete gains in FLOPs and downstream task performance.

Core claim

Wide models can be assembled by recursively coupling narrower, independently trained models in a parameter-free block-diagonal way, which supports flexible budget allocation and delivers 30% FLOPs reduction at similar ImageNet test accuracies compared with standard training while also outperforming model growth protocols at equal compute.

What carries the argument

Recursive block-diagonal coupling (RBDC), a parameter-free operation that combines narrower models without joint fine-tuning while aiming to preserve representational capacity.

If this is right

Models trained with RBDC achieve higher performance at the same training FLOPs than protocols from the model growth literature.
The resulting models serve as stronger backbones than their original counterparts for downstream object detection and instance segmentation.
Training budget can be allocated flexibly across the narrower models involved in the recursive construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may allow reuse of existing smaller model checkpoints to construct larger ones under fixed total compute.
If the coupling preserves capacity across domains, similar savings could appear in non-vision tasks where narrow models are already available.
The independence of the initial trainings opens the possibility of parallel or distributed development of the component models before coupling.

Load-bearing premise

Independently trained narrower models can be combined via parameter-free block-diagonal coupling while preserving representational capacity without requiring joint fine-tuning to recover performance.

What would settle it

A direct measurement showing that the block-diagonally coupled model accuracy on ImageNet falls substantially below the level predicted from the narrower models' individual accuracies, with no recovery possible without additional joint training.

Figures

Figures reproduced from arXiv: 2605.23656 by Adrien Deli\`ege, Marc Van Droogenbroeck, Maxim Henry, S\'ebastien Pi\'erard.

**Figure 1.** Figure 1: Overview. We introduce a resource-efficient training protocol for high-capacity computer vision models (e.g., transformers, CNNs) that we call Recursive BlockDiagonal Coupling (RBDC). To train a target model, RBDC recursively trains narrower models and couples them to build a block-diagonal initialization for training wider models (top). The budget of training FLOPs is distributed according to the differ… view at source ↗

**Figure 2.** Figure 2: Resource Efficiency of RBDC . Accuracy vs. FLOPs trade-off curves when training a target DeiT-Base model. The values written in gray correspond to the number of training epochs for each recursive step, from the narrowest to the target model. RBDC with two recursive steps (green) is more efficient than RBDC with one recursive step (orange), which is better than the standard training baseline (blue), yield… view at source ↗

**Figure 3.** Figure 3: Optimal Training Ratio. Experimental results corresponding to the training on ImageNet-1K of DeiT-Small (left) and DeiT-Base (right) models with one recursive step of RBDC. Top: the curves indicate an optimal ratio around the range between 1.5 and 2.0 across both model scales. Bottom: the repartition in training epochs of the narrow and wide phases for each point creating lines where the slope corresponds … view at source ↗

read the original abstract

Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RBDC introduces recursive block-diagonal coupling of independently trained narrow models to cut training FLOPs, but the abstract leaves the key capacity-preservation question open.

read the letter

The core idea is to train several narrow models separately, then recursively assemble them into a wider target model using parameter-free block-diagonal coupling. This is presented as distinct from standard model growth techniques because the sub-models stay independent until the coupling step and the training budget can be split flexibly across them. On ImageNet the method reportedly delivers 30% lower FLOPs at matched accuracy for both DeiT and ResNet, beats growth baselines at equal compute, and produces stronger backbones for detection and segmentation.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Recursive Block-Diagonal Coupling (RBDC), a training protocol that constructs wide vision models (DeiT, ResNet) on ImageNet by recursively coupling narrower models that were trained independently, using parameter-free block-diagonal connections. This enables flexible allocation of the total training budget across sub-models. The central empirical claims are a 30% reduction in training FLOPs at matched test accuracy versus training the target wide model from scratch, superior accuracy at equal FLOPs versus prior model-growth protocols, and improved performance when the resulting models are used as backbones for object detection and instance segmentation.

Significance. If the efficiency claims hold under rigorous controls, the method would provide a practical route to amortize training cost across independently trained narrower models while avoiding joint optimization overhead. The recursive construction and applicability to both CNNs and transformers are positive features; the downstream-task results, if confirmed, would further strengthen the case for adoption. The absence of any machine-checked proofs or open code is noted but does not detract from the empirical focus.

major comments (3)

[Abstract / Results] Abstract and experimental protocol: the 30% FLOPs-reduction claim at matched accuracy is presented without any description of the number of independent runs, standard deviation, exact baseline implementations (including whether the same data-augmentation and optimizer schedules were used), or rules for excluding runs or images. These omissions make it impossible to determine whether the reported savings are statistically supported or sensitive to implementation details.
[Method (RBDC definition)] Method section on block-diagonal coupling: the construction keeps weight blocks strictly separate, so that convolutions and (in DeiT) attention operate only within blocks at every layer. The efficiency claim requires that this restriction does not materially reduce representational capacity relative to a jointly trained wide model. No ablation or analysis is supplied that quantifies the effect of absent cross-block interactions or demonstrates that the recursive coupling restores any necessary mixing.
[Experiments / Efficiency comparison] Results on training budget: the reported FLOPs savings assume that no post-coupling fine-tuning is performed. If any recovery step is used, its cost must be added to the total budget; the manuscript does not state whether such a step occurs or how its cost is accounted for in the 30% figure.

minor comments (1)

[Method] Notation for the recursive coupling operator is introduced without an explicit equation or pseudocode listing the base case and inductive step; a compact formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. Below we provide point-by-point responses to the major comments, indicating where revisions have been made to address the concerns raised.

read point-by-point responses

Referee: [Abstract / Results] Abstract and experimental protocol: the 30% FLOPs-reduction claim at matched accuracy is presented without any description of the number of independent runs, standard deviation, exact baseline implementations (including whether the same data-augmentation and optimizer schedules were used), or rules for excluding runs or images. These omissions make it impossible to determine whether the reported savings are statistically supported or sensitive to implementation details.

Authors: We agree that the original submission omitted key statistical and implementation details. In the revised manuscript we now report results from three independent runs per configuration, include mean accuracy with standard deviation, explicitly state that all baselines (including from-scratch training and prior growth methods) used identical data-augmentation pipelines and optimizer schedules drawn from the original publications, and confirm that no runs or images were excluded on the basis of performance. These additions directly support the statistical reliability of the reported 30% FLOPs reduction. revision: yes
Referee: [Method (RBDC definition)] Method section on block-diagonal coupling: the construction keeps weight blocks strictly separate, so that convolutions and (in DeiT) attention operate only within blocks at every layer. The efficiency claim requires that this restriction does not materially reduce representational capacity relative to a jointly trained wide model. No ablation or analysis is supplied that quantifies the effect of absent cross-block interactions or demonstrates that the recursive coupling restores any necessary mixing.

Authors: The block-diagonal construction is deliberate to permit fully independent training of the narrower sub-models. While intra-layer cross-block interactions are absent after each coupling step, the recursive procedure progressively mixes information across blocks in subsequent stages. We acknowledge that the original manuscript did not contain a dedicated ablation isolating this effect. The revised version adds a new analysis that compares standard RBDC against a controlled variant with explicit cross-block mixing; the accuracy gap on ImageNet is negligible, indicating that recursive coupling restores sufficient representational capacity for the observed performance. revision: yes
Referee: [Experiments / Efficiency comparison] Results on training budget: the reported FLOPs savings assume that no post-coupling fine-tuning is performed. If any recovery step is used, its cost must be added to the total budget; the manuscript does not state whether such a step occurs or how its cost is accounted for in the 30% figure.

Authors: Our protocol performs no post-coupling fine-tuning; the block-diagonal connections are parameter-free and the resulting model is evaluated directly. Consequently the 30% FLOPs figure accounts exclusively for the training cost of the narrower sub-models. The revised manuscript now states this explicitly in both the method description and the experimental protocol section, removing any ambiguity about budget accounting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical protocol evaluated against external baselines

full rationale

The paper presents RBDC as a parameter-free recursive block-diagonal coupling of independently trained narrower models, with efficiency claims (30% FLOPs reduction at matched accuracy, gains vs. growth baselines) resting on direct empirical measurements on ImageNet for DeiT and ResNet. No equations, fitted parameters renamed as predictions, self-definitional constructions, or load-bearing self-citations appear in the abstract or described protocol. The derivation chain consists of training sub-models separately then coupling them, with performance assessed via standard benchmarks rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no free parameters, axioms, or invented entities; the coupling is explicitly parameter-free and the results are empirical.

pith-pipeline@v0.9.0 · 5713 in / 1036 out tokens · 26863 ms · 2026-05-25T04:42:26.511784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Banner, R., Nahshan, Y., Soudry, D.: Post training 4-bit quantization of con- volutional networks for rapid-deployment. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 31, pp. 7950–7958. Curran Assoc. Inc., Vancouver, Can. (Dec 2018) 3

work page 2018
[2]

Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image trans- formers. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–18. Virtual conference (Apr 2022) 14

work page 2022
[3]

Net2Net: Accelerating Learning via Knowledge Transfer

Chen, T., Goodfellow, I., Shlens, J.: Net2Net: Accelerating learning via knowledge transfer. arXivabs/1511.05641(2015) 4, 5, 7, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training text encoders as discriminators rather than generators. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–18. Addis Ababa, Ethiopia (Apr 2020) 14

work page 2020
[5]

Evci, U., van Merriënboer, B., Unterthiner, T., Vladymyrov, M., Pedregosa, F.: GradMax: Growing neural networks using gradient information. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–17. Virtual conference (Sept 2022) 4, 7, 11, 12

work page 2022
[6]

com / facebookresearch/deit8

Facebook Research: Official DeiT repository,https : / / github . com / facebookresearch/deit8

work page
[7]

Gong, L., He, D., Li, Z., Qin, T., Wang, L., Liu, T.: Efficient training of BERT by progressively stacking. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 97, pp. 2337–2346. ML Res. Press (Jun 2019) 4

work page 2019
[8]

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Int. J. Comput. Vis.129(6), 1789–1819 (Mar 2021) 3

work page 2021
[9]

Gusak, J., Cherniuk, D., Shilova, A., Katrutsa, A., Bershatsky, D., Zhao, X., Eyraud-Dubois, L., Shliazhko, O., Dimitrov, D., Oseledets, I., Beaumont, O.: Sur- vey on efficient training of large neural networks. In: Int. Jt. Conf. Artif. Intell. (IJCAI). pp. 5494–5501. Vienna, Austria (Jul 2022) 3

work page 2022
[10]

Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: Adv. Neural Inf. Process. Syst. (NeurIPS). pp. 1135–1143. Curran Assoc. Inc., Montréal, Can. (Dec 2015) 3

work page 2015
[11]

In: IEEE Int

He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: IEEE Int. Conf. Comput. Vis. (ICCV). pp. 2980–2988. IEEE, Venice, Italy (Oct 2017) 13, 14

work page 2017
[12]

arXivabs/2507.21573(2025) 3

Henry,M.,Deliège,A.,Cioppa,A.,VanDroogenbroeck,M.:LinDeps:Afine-tuning freepost-pruningmethodtoremovelayer-wiselineardependencieswithguaranteed performance preservation. arXivabs/2507.21573(2025) 3

work page arXiv 2025
[13]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G.v.d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models. In: Adv. Neural Inf....

work page 2022
[14]

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 97, pp. 2790–

work page
[15]

Press (Jun 2019) 3, 4

ML Res. Press (Jun 2019) 3, 4

work page 2019
[16]

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–13. Virtual conference (Sept 2022) 3, 4, 7 Resource-Efficient Training of Vision Models 17

work page 2022
[17]

IEEE/CAA J

Li, X., Duan, H., Tian, Y., Wang, F.Y.: Exploring image generation for UAV change detection. IEEE/CAA J. Autom. Sin.9(6), 1061–1072 (Jun 2022) 13

work page 2022
[18]

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Eur. Conf. Comput. Vis. (ECCV). Lect. Notes Comput. Sci., vol. 8693, pp. 740–755. Springer (Sept 2014) 3, 13, 14

work page 2014
[19]

Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–16. Toulon, France (Apr 2017) 8

work page 2017
[20]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Int. Conf. Learn. Represent. (ICLR). New Orleans, LA, USA (May 2019) 8, 14

work page 2019
[21]

ACM Comput

Menghani, G.: Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv.55(12), 1–37 (Mar 2023) 3

work page 2023
[22]

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., Wu, H.: Mixed precision training. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–12. Vancouver, Can. (Apr 2018) 4

work page 2018
[23]

FP8 Formats for Deep Learning

Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., Mellempudi, N., Oberman, S., Shoeybi, M., Siu, M., Wu, H.: FP8 formats for deep learning. arXivabs/2209.05433(2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

In: IEEE/CVF Winter Conf

Pham, C., Teterwak, P., Nelson, S., Plummer, B.A.: MixtureGrowth: Growing neural networks by recombining learned parameters. In: IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV). pp. 2788–2797. IEEE, Waikoloa, HI, USA (Jan 2024) 5, 7, 8, 11, 12

work page 2024
[25]

arXivabs/2304.03589(2023) 3

Shen, L., Sun, Y., Yu, Z., Ding, L., Tian, X., Tao, D.: On efficient training of large- scale deep learning models: A literature review. arXivabs/2304.03589(2023) 3

work page arXiv 2023
[26]

sustainability-directory.com/term/resource-efficient-machine-learning (2026) 1

Sustainability Directory: Resource-efficient machine learning.https://climate. sustainability-directory.com/term/resource-efficient-machine-learning (2026) 1

work page 2026
[27]

Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE105(12), 2295–2329 (Dec 2017) 3

work page 2017
[28]

In: AAAI Conf

Thompson, N., Fleming, M., Tang, B.J., Pastwa, A.M., Borge, N., Goehring, B.C., Das, S.: A model for estimating the economic costs of computer vision systems that use deep learning. In: AAAI Conf. Artif. Intell. vol. 38, pp. 23012–23018. Assoc. Adv. Artif. Intell. (AAAI), Vancouver, Can. (Mar 2024) 1, 4

work page 2024
[29]

and Manso, G.F., 2020

Thompson, N.C., Greenewald, K., Lee, K., Manso, G.F.: The computational limits of deep learning. arXivabs/2007.05558(2020) 3

work page arXiv 2007
[30]

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 139, pp. 10347–10357. ML Res. Press, Virtual Conf. (Jul 2021) 7, 8, 13

work page 2021
[31]

Touvron, H., Cord, M., Jégou, H.: DeiT III: Revenge of the ViT. In: Eur. Conf. Comput. Vis. (ECCV). Lect. Notes Comput. Sci., vol. 13684, pp. 516–533. Springer Nat. Switz., Tel Aviv, Israël (2022) 4, 7, 9

work page 2022
[32]

Wightman, R.: Pytorch image models.https://github.com/rwightman/pytorch- image-models(2019) 12

work page 2019
[33]

Wu, L., Liu, B., Stone, P., Liu, Q.: Firefly neural architecture descent: a general ap- proach for growing neural networks. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 33, pp. 22373–22383. Curran Assoc. Inc., Virtual conference (Dec 2020) 11, 12

work page 2020
[34]

IEEE Trans

Xu, L., Xie, H., Qin, S.J., Tao, X., Wang, F.L.: Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. IEEE Trans. Pattern Anal. Mach. Intell. pp. 1–20 (2026) 3, 4 18 M. Henry et al

work page 2026
[35]

Wide Residual Networks

Zagoruyko, S., Komodakis, N.: Wide residual networks. arXivabs/1605.07146 (2016) 7 Resource-Efficient Training of Vision Models 19 A Supplementary Material This is the supplementary material for the paperRecursive Block-Diagonal Cou- pling for Resource-Efficient Training of Vision Models. A.1 Per-layer Implementation Details To merge two narrow vision mod...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Banner, R., Nahshan, Y., Soudry, D.: Post training 4-bit quantization of con- volutional networks for rapid-deployment. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 31, pp. 7950–7958. Curran Assoc. Inc., Vancouver, Can. (Dec 2018) 3

work page 2018

[2] [2]

Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image trans- formers. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–18. Virtual conference (Apr 2022) 14

work page 2022

[3] [3]

Net2Net: Accelerating Learning via Knowledge Transfer

Chen, T., Goodfellow, I., Shlens, J.: Net2Net: Accelerating learning via knowledge transfer. arXivabs/1511.05641(2015) 4, 5, 7, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training text encoders as discriminators rather than generators. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–18. Addis Ababa, Ethiopia (Apr 2020) 14

work page 2020

[5] [5]

Evci, U., van Merriënboer, B., Unterthiner, T., Vladymyrov, M., Pedregosa, F.: GradMax: Growing neural networks using gradient information. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–17. Virtual conference (Sept 2022) 4, 7, 11, 12

work page 2022

[6] [6]

com / facebookresearch/deit8

Facebook Research: Official DeiT repository,https : / / github . com / facebookresearch/deit8

work page

[7] [7]

Gong, L., He, D., Li, Z., Qin, T., Wang, L., Liu, T.: Efficient training of BERT by progressively stacking. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 97, pp. 2337–2346. ML Res. Press (Jun 2019) 4

work page 2019

[8] [8]

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Int. J. Comput. Vis.129(6), 1789–1819 (Mar 2021) 3

work page 2021

[9] [9]

Gusak, J., Cherniuk, D., Shilova, A., Katrutsa, A., Bershatsky, D., Zhao, X., Eyraud-Dubois, L., Shliazhko, O., Dimitrov, D., Oseledets, I., Beaumont, O.: Sur- vey on efficient training of large neural networks. In: Int. Jt. Conf. Artif. Intell. (IJCAI). pp. 5494–5501. Vienna, Austria (Jul 2022) 3

work page 2022

[10] [10]

Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: Adv. Neural Inf. Process. Syst. (NeurIPS). pp. 1135–1143. Curran Assoc. Inc., Montréal, Can. (Dec 2015) 3

work page 2015

[11] [11]

In: IEEE Int

He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: IEEE Int. Conf. Comput. Vis. (ICCV). pp. 2980–2988. IEEE, Venice, Italy (Oct 2017) 13, 14

work page 2017

[12] [12]

arXivabs/2507.21573(2025) 3

Henry,M.,Deliège,A.,Cioppa,A.,VanDroogenbroeck,M.:LinDeps:Afine-tuning freepost-pruningmethodtoremovelayer-wiselineardependencieswithguaranteed performance preservation. arXivabs/2507.21573(2025) 3

work page arXiv 2025

[13] [13]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G.v.d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models. In: Adv. Neural Inf....

work page 2022

[14] [14]

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 97, pp. 2790–

work page

[15] [15]

Press (Jun 2019) 3, 4

ML Res. Press (Jun 2019) 3, 4

work page 2019

[16] [16]

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–13. Virtual conference (Sept 2022) 3, 4, 7 Resource-Efficient Training of Vision Models 17

work page 2022

[17] [17]

IEEE/CAA J

Li, X., Duan, H., Tian, Y., Wang, F.Y.: Exploring image generation for UAV change detection. IEEE/CAA J. Autom. Sin.9(6), 1061–1072 (Jun 2022) 13

work page 2022

[18] [18]

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Eur. Conf. Comput. Vis. (ECCV). Lect. Notes Comput. Sci., vol. 8693, pp. 740–755. Springer (Sept 2014) 3, 13, 14

work page 2014

[19] [19]

Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–16. Toulon, France (Apr 2017) 8

work page 2017

[20] [20]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Int. Conf. Learn. Represent. (ICLR). New Orleans, LA, USA (May 2019) 8, 14

work page 2019

[21] [21]

ACM Comput

Menghani, G.: Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv.55(12), 1–37 (Mar 2023) 3

work page 2023

[22] [22]

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., Wu, H.: Mixed precision training. In: Int. Conf. Learn. Represent. (ICLR). pp. 1–12. Vancouver, Can. (Apr 2018) 4

work page 2018

[23] [23]

FP8 Formats for Deep Learning

Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., Mellempudi, N., Oberman, S., Shoeybi, M., Siu, M., Wu, H.: FP8 formats for deep learning. arXivabs/2209.05433(2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

In: IEEE/CVF Winter Conf

Pham, C., Teterwak, P., Nelson, S., Plummer, B.A.: MixtureGrowth: Growing neural networks by recombining learned parameters. In: IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV). pp. 2788–2797. IEEE, Waikoloa, HI, USA (Jan 2024) 5, 7, 8, 11, 12

work page 2024

[25] [25]

arXivabs/2304.03589(2023) 3

Shen, L., Sun, Y., Yu, Z., Ding, L., Tian, X., Tao, D.: On efficient training of large- scale deep learning models: A literature review. arXivabs/2304.03589(2023) 3

work page arXiv 2023

[26] [26]

sustainability-directory.com/term/resource-efficient-machine-learning (2026) 1

Sustainability Directory: Resource-efficient machine learning.https://climate. sustainability-directory.com/term/resource-efficient-machine-learning (2026) 1

work page 2026

[27] [27]

Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE105(12), 2295–2329 (Dec 2017) 3

work page 2017

[28] [28]

In: AAAI Conf

Thompson, N., Fleming, M., Tang, B.J., Pastwa, A.M., Borge, N., Goehring, B.C., Das, S.: A model for estimating the economic costs of computer vision systems that use deep learning. In: AAAI Conf. Artif. Intell. vol. 38, pp. 23012–23018. Assoc. Adv. Artif. Intell. (AAAI), Vancouver, Can. (Mar 2024) 1, 4

work page 2024

[29] [29]

and Manso, G.F., 2020

Thompson, N.C., Greenewald, K., Lee, K., Manso, G.F.: The computational limits of deep learning. arXivabs/2007.05558(2020) 3

work page arXiv 2007

[30] [30]

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Int. Conf. Mach. Learn. (ICML). Proc. Mach. Learn. Res., vol. 139, pp. 10347–10357. ML Res. Press, Virtual Conf. (Jul 2021) 7, 8, 13

work page 2021

[31] [31]

Touvron, H., Cord, M., Jégou, H.: DeiT III: Revenge of the ViT. In: Eur. Conf. Comput. Vis. (ECCV). Lect. Notes Comput. Sci., vol. 13684, pp. 516–533. Springer Nat. Switz., Tel Aviv, Israël (2022) 4, 7, 9

work page 2022

[32] [32]

Wightman, R.: Pytorch image models.https://github.com/rwightman/pytorch- image-models(2019) 12

work page 2019

[33] [33]

Wu, L., Liu, B., Stone, P., Liu, Q.: Firefly neural architecture descent: a general ap- proach for growing neural networks. In: Adv. Neural Inf. Process. Syst. (NeurIPS). vol. 33, pp. 22373–22383. Curran Assoc. Inc., Virtual conference (Dec 2020) 11, 12

work page 2020

[34] [34]

IEEE Trans

Xu, L., Xie, H., Qin, S.J., Tao, X., Wang, F.L.: Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. IEEE Trans. Pattern Anal. Mach. Intell. pp. 1–20 (2026) 3, 4 18 M. Henry et al

work page 2026

[35] [35]

Wide Residual Networks

Zagoruyko, S., Komodakis, N.: Wide residual networks. arXivabs/1605.07146 (2016) 7 Resource-Efficient Training of Vision Models 19 A Supplementary Material This is the supplementary material for the paperRecursive Block-Diagonal Cou- pling for Resource-Efficient Training of Vision Models. A.1 Per-layer Implementation Details To merge two narrow vision mod...

work page internal anchor Pith review Pith/arXiv arXiv 2016