pith. sign in

arxiv: 2605.19533 · v1 · pith:BDOOUN6Inew · submitted 2026-05-19 · 💻 cs.CV

Replacement Learning: Training Neural Networks with Fewer Parameters

Pith reviewed 2026-05-20 06:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords replacement learningparameter reductionefficient trainingsurrogate operatorsneural networksCNNvision transformersbackpropagation
0
0 comments X

The pith

Replacement Learning trains neural networks more efficiently by replacing selected blocks with lightweight surrogate operators synthesized from adjacent blocks' parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Replacement Learning as a training paradigm that replaces some neural network blocks with lightweight alternatives to reduce redundancy in deep models. For each replaced block, a learnable layer creates a surrogate operation by transforming parameters from the blocks immediately before and after it, then applies this to the incoming activation. This avoids full computation and backpropagation through the original block while aiming to keep the contextual information flowing. Experiments demonstrate reductions in parameters, memory, and time, with accuracy matching or exceeding standard training on datasets like CIFAR-10, ImageNet, and tasks including detection and segmentation. The method is shown to work for both convolutional networks and vision transformers, with additional tests confirming compatibility with other techniques.

Core claim

Replacement Learning (RepL) reduces full-depth redundancy in neural network training by replacing selected blocks with a lightweight computing layer. This layer synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation and applies the synthesized operator to the preceding activation. In this manner, RepL preserves local contextual continuity without requiring the full-layer computation and differentiation of the replaced block. Tailored parameter-fusion blocks are used for CNNs and ViTs to handle their specific structures. This leads to fewer trainable parameters, lower GPU memory usage, and shorter training times,

What carries the argument

The learnable transformation in parameter-fusion blocks that synthesizes a surrogate operator from the parameters of adjacent preceding and succeeding blocks.

Load-bearing premise

A learnable transformation synthesizing a surrogate operator from the parameters of adjacent preceding and succeeding blocks can preserve local contextual continuity and overall representation capacity without requiring full-depth backpropagation through the replaced block.

What would settle it

If a RepL-trained model on ImageNet achieves substantially lower top-1 accuracy than a standard end-to-end trained model under identical conditions, the performance-matching claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.19533 by Dongzhi Guan, Hengyu Shi, Jiabin Liu, Jiaji Wang, Junhao Su, Peizhe Wang, Tianyang Han, Yuming Zhang.

Figure 1
Figure 1. Figure 1: Comparison between different backbones with Replacement Learning and End-to-End training regarding GPU memory and Test accuracy. The diameter [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of (a) End-to-End training and (b) our proposed Replacement Learning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of feature maps. (a) Feature map of ResNet-32 with End-to-End training. (b) Feature map of ResNet-32 with Replacement Learning. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: T-SNE visualization. (a) is t-SNE of E2E training, and (b) is t-SNE [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Replacement Learning (RepL), a training-time paradigm for deep neural networks that replaces selected blocks with lightweight layers synthesizing surrogate operators from the parameters of adjacent preceding and succeeding blocks via a learnable transformation. This is claimed to reduce trainable parameters, GPU memory, and training time while preserving local contextual continuity and achieving performance that matches or surpasses standard end-to-end training. The approach is instantiated for CNNs and ViTs and evaluated on classification (CIFAR-10, SVHN, STL-10, ImageNet), detection (COCO), and segmentation (CityScapes), with additional results on transfer learning, quantization, and other techniques.

Significance. If the central empirical claims hold under rigorous controls, RepL could offer a practical route to more efficient deep-network training by mitigating full-depth backpropagation redundancy without explicit layer pruning or reuse architectures. The compatibility with ViTs, stochastic depth, and INT8 quantization would strengthen its generality for modern vision pipelines.

major comments (3)
  1. [Method] Method section: The core assumption that a lightweight learnable fusion of parameters from only the preceding and succeeding blocks can produce an activation functionally interchangeable with the removed block's distinct nonlinear mapping (e.g., specific channel mixing or attention patterns) is load-bearing for the 'matching or surpassing' performance guarantee, yet no theoretical recovery bound or controlled ablation isolating the synthesis mechanism from simple depth reduction is provided.
  2. [Experiments] Experimental results: Claims of reduced parameters, memory, and time with parity or better accuracy rest on unspecified baselines, number of runs, statistical tests, and controls for the replacement choice; without these, it is impossible to rule out that observed gains arise from effective network thinning rather than the surrogate operator.
  3. [Table 2] Table reporting main results (e.g., ImageNet or COCO rows): The absence of variance across seeds or explicit comparison to depth-matched pruned baselines makes it difficult to assess whether the reported improvements are robust or merely consistent with reduced effective depth.
minor comments (2)
  1. [§3.2] Notation for the learnable transformation parameters could be unified across CNN and ViT instantiations to avoid reader confusion when comparing fusion blocks.
  2. [Figure 2] Figure illustrating the replacement block would benefit from explicit arrows showing gradient flow during the synthesis step.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment in detail below and have made revisions to the manuscript to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [Method] Method section: The core assumption that a lightweight learnable fusion of parameters from only the preceding and succeeding blocks can produce an activation functionally interchangeable with the removed block's distinct nonlinear mapping (e.g., specific channel mixing or attention patterns) is load-bearing for the 'matching or surpassing' performance guarantee, yet no theoretical recovery bound or controlled ablation isolating the synthesis mechanism from simple depth reduction is provided.

    Authors: We acknowledge that a formal theoretical recovery bound would provide stronger guarantees. However, establishing such a bound for arbitrary nonlinear mappings in deep networks is beyond the scope of this work and remains an open challenge in the field. To address the concern empirically, we have added a controlled ablation study (new Section 4.3) that compares RepL against networks where blocks are simply removed without the surrogate fusion (i.e., depth reduction). The results demonstrate that the learnable parameter-fusion surrogate contributes measurably to performance preservation, beyond what depth reduction alone achieves. We have also clarified the design choices for the fusion blocks in the revised Method section. revision: partial

  2. Referee: [Experiments] Experimental results: Claims of reduced parameters, memory, and time with parity or better accuracy rest on unspecified baselines, number of runs, statistical tests, and controls for the replacement choice; without these, it is impossible to rule out that observed gains arise from effective network thinning rather than the surrogate operator.

    Authors: We thank the referee for pointing this out. In the revised manuscript, we have explicitly stated the baselines as standard end-to-end training on identical architectures. We now report results averaged over 5 independent runs with different random seeds, include p-values from paired t-tests to assess statistical significance, and provide controls by experimenting with different replacement strategies (e.g., replacing every other block vs. specific stages). To further rule out thinning effects, we added comparisons to equivalent-parameter pruned models in the experiments section. revision: yes

  3. Referee: [Table 2] Table reporting main results (e.g., ImageNet or COCO rows): The absence of variance across seeds or explicit comparison to depth-matched pruned baselines makes it difficult to assess whether the reported improvements are robust or merely consistent with reduced effective depth.

    Authors: We have updated Table 2 to report mean performance with standard deviations across the 5 seeds. Furthermore, we introduced a new comparison table (Table 3) that includes depth-matched pruned baselines, where layers are removed to achieve similar parameter counts and effective depth as our RepL models. These pruned baselines underperform RepL, indicating that the surrogate operators provide benefits not attributable to depth reduction alone. revision: yes

standing simulated objections not resolved
  • Providing a theoretical recovery bound for the parameter synthesis mechanism

Circularity Check

0 steps flagged

No circularity: RepL is a constructively defined new training paradigm validated empirically

full rationale

The paper introduces Replacement Learning (RepL) as a novel training-time paradigm that replaces selected blocks with a lightweight learnable fusion layer synthesizing a surrogate operator from adjacent block parameters. This is a direct architectural definition and training procedure rather than a mathematical derivation that reduces to its own inputs by construction. Claims of parameter reduction and performance parity are supported by extensive experiments across multiple datasets and tasks (CIFAR-10, ImageNet, COCO, etc.), without invoking self-citations for load-bearing uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled from prior author work. The central mechanism is independently specified and evaluated, making the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method assumes correlated learning patterns between neighboring layers can be captured by a parameter-fusion block; this is an ad-hoc modeling choice introduced to enable the replacement without full computation.

free parameters (1)
  • learnable transformation parameters
    Parameters of the lightweight computing layer that synthesize the surrogate operator from adjacent blocks.
axioms (1)
  • domain assumption Neighboring layers exhibit highly correlated learning patterns that can be exploited by a surrogate operator.
    Invoked in the motivation for replacing blocks rather than discarding them.

pith-pipeline@v0.9.0 · 5820 in / 1020 out tokens · 32115 ms · 2026-05-20T06:15:44.542768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

  1. [1]

    Deep learning for in vitro prediction of pharmaceutical formulations,

    Y . Yang, Z. Ye, Y . Su, Q. Zhao, X. Li, and D. Ouyang, “Deep learning for in vitro prediction of pharmaceutical formulations,”Acta Pharmaceutica Sinica B, vol. 9, no. 1, p. 177–185, Jan. 2019. [Online]. Available: http://dx.doi.org/10.1016/j.apsb.2018.09.010

  2. [2]

    Deep supervised learning using local errors,

    H. Mostafa, V . Ramesh, and G. Cauwenberghs, “Deep supervised learning using local errors,”Frontiers in neuroscience, vol. 12, p. 608, 2018

  3. [3]

    Deep convolution neural networks in computer vision: a review,

    H.-J. Yoo, “Deep convolution neural networks in computer vision: a review,”IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, pp. 35–43, 2015

  4. [4]

    Deep learning for computer vision: A brief review,

    A. V oulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,”Computational intelligence and neuroscience, vol. 2018, no. 1, p. 7068349, 2018

  5. [5]

    A primer on neural network models for natural language processing,

    Y . Goldberg, “A primer on neural network models for natural language processing,”Journal of Artificial Intelligence Research, vol. 57, pp. 345– 420, 2016

  6. [6]

    Neural network methods for natural language processing,

    Y . Goldberg and G. Hirst, “Neural network methods for natural language processing,” 2017

  7. [7]

    Recurrent neural network with backpropagation through time for speech recognition,

    A. M. Ahmad, S. Ismail, and D. Samaon, “Recurrent neural network with backpropagation through time for speech recognition,” inIEEE In- ternational Symposium on Communications and Information Technology,

  8. [8]

    ISCIT 2004., vol. 1. IEEE, 2004, pp. 98–102

  9. [9]

    Chauvin and D

    Y . Chauvin and D. E. Rumelhart,Backpropagation: theory, architec- tures, and applications. Psychology press, 2013

  10. [10]

    A new method to im- prove the gradient based search direction to enhance the computational efficiency of back propagation based neural network algorithms,

    N. M. Nawi, R. S. Ransing, and M. R. Ransing, “A new method to im- prove the gradient based search direction to enhance the computational efficiency of back propagation based neural network algorithms,” in 2008 Second Asia International Conference on Modelling & Simulation (AMS). IEEE, 2008, pp. 546–552

  11. [11]

    To update or not to update? neurons at equilibrium in deep models,

    A. Bragagnolo, E. Tartaglione, and M. Grangetto, “To update or not to update? neurons at equilibrium in deep models,”Advances in neural information processing systems, vol. 35, pp. 22 149–22 160, 2022

  12. [12]

    Redundant information neural estimation,

    M. Kleinman, A. Achille, S. Soatto, and J. C. Kao, “Redundant information neural estimation,”Entropy, vol. 23, no. 7, p. 922, 2021

  13. [13]

    Random feedback weights support learning in deep neural networks,

    T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random feedback weights support learning in deep neural networks,”

  14. [14]

    Random feedback weights support learning in deep neural networks

    [Online]. Available: https://arxiv.org/abs/1411.0247

  15. [15]

    Direct feedback alignment provides learning in deep neural networks,

    A. Nøkland, “Direct feedback alignment provides learning in deep neural networks,”Advances in neural information processing systems, vol. 29, 2016

  16. [16]

    Error-driven input modulation: Solving the credit assignment problem without a backward pass,

    G. Dellaferrera and G. Kreiman, “Error-driven input modulation: Solving the credit assignment problem without a backward pass,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 4937–4955

  17. [17]

    Scaling forward gradient with local losses,

    M. Ren, S. Kornblith, R. Liao, and G. Hinton, “Scaling forward gradient with local losses,” 2023. [Online]. Available: https://arxiv.org/ abs/2210.03310

  18. [18]

    Momentum auxiliary network for supervised local learning,

    J. Su, C. Cai, F. Zhu, C. He, X. Xu, D. Guan, and C. Si, “Momentum auxiliary network for supervised local learning,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 276–292. IEEE 16

  19. [19]

    Hpff: Hierarchical locally supervised learning with patch feature fusion,

    J. Su, C. He, F. Zhu, X. Xu, D. Guan, and C. Si, “Hpff: Hierarchical locally supervised learning with patch feature fusion,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 293–309

  20. [20]

    Faster multi-gpu training with ppll: A pipeline parallelism framework leveraging local learning,

    X. Guo, C. Xu, G. Guo, F. Zhu, C. Cai, P. Wang, X. Wei, J. Su, and J. Gao, “Faster multi-gpu training with ppll: A pipeline parallelism framework leveraging local learning,”arXiv preprint arXiv:2411.12780, 2024

  21. [21]

    Learning internal representations by error propagation,

    D. E. Rumelhart, G. E. Hinton, R. J. Williamset al., “Learning internal representations by error propagation,” 1985

  22. [22]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

  23. [23]

    Skip-attention: Improving vision transformers by paying less attention,

    S. Venkataramanan, A. Ghodrati, Y . M. Asano, F. Porikli, and A. Habibian, “Skip-attention: Improving vision transformers by paying less attention,” 2023. [Online]. Available: https://arxiv.org/abs/2301. 02240

  24. [24]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

  25. [25]

    An analysis of single-layer networks in unsupervised feature learning,

    A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inProceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 215–223

  26. [26]

    Reading digits in natural images with unsupervised feature learning,

    Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y . Nget al., “Reading digits in natural images with unsupervised feature learning,” inNIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2. Granada, 2011, p. 4

  27. [27]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

  28. [28]

    Microsoft COCO: Common Objects in Context

    T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll ´ar, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312

  29. [29]

    Depgraph: Towards any structural pruning,

    G. Fang, X. Ma, M. Song, M. B. Mi, and X. Wang, “Depgraph: Towards any structural pruning,” 2023. [Online]. Available: https: //arxiv.org/abs/2301.12900

  30. [30]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531

  31. [31]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference,

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in2018 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713

  32. [32]

    DARTS: Differentiable architecture search,

    H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX

  33. [33]

    Difference target prop- agation,

    D.-H. Lee, S. Zhang, A. Fischer, and Y . Bengio, “Difference target prop- agation,” inMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15. Springer, 2015, pp. 498–515

  34. [34]

    Assessing the scalability of biologically-motivated deep learning algorithms and architectures,

    S. Bartunov, A. Santoro, B. Richards, L. Marris, G. E. Hinton, and T. Lillicrap, “Assessing the scalability of biologically-motivated deep learning algorithms and architectures,”Advances in neural information processing systems, vol. 31, 2018

  35. [35]

    Decoupled neural interfaces using synthetic gradients,

    M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu, “Decoupled neural interfaces using synthetic gradients,” inInternational conference on machine learning. PMLR, 2017, pp. 1627–1635

  36. [36]

    Mlaan: Scaling supervised local learning with multilaminar leap augmented auxiliary network,

    Y . Zhang, S. Zhang, P. Wang, F. Zhu, D. Guan, J. Su, J. Liu, and C. Cai, “Mlaan: Scaling supervised local learning with multilaminar leap augmented auxiliary network,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 21, 2025, pp. 22 686–22 694

  37. [37]

    Man++: Scaling momentum auxiliary network for supervised local learning in vision tasks,

    J. Su, F. Zhu, H. Shi, T. Han, Y . Qiu, J. Luo, X. Wei, and J. Gao, “Man++: Scaling momentum auxiliary network for supervised local learning in vision tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  38. [38]

    Advancing supervised local learning beyond classification with long-term feature bank,

    F. Zhu, Y . Zhang, X. Guo, H. Shi, J. Luo, J. Su, and J. Gao, “Advancing supervised local learning beyond classification with long-term feature bank,”arXiv preprint arXiv:2406.00446, 2024

  39. [39]

    Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    H. Shi, T. Han, P. Wang, Z. Wang, X. Yang, and J. Su, “Rethinking local learning: A cheaper and faster recipe for llm post-training,”arXiv preprint arXiv:2605.04913, 2026

  40. [40]

    Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

    T. Han, H. Shi, J. Hu, X. Yang, Z. Wang, and J. Su, “Correct is not enough: Training reasoning planners with executor-grounded rewards,” arXiv preprint arXiv:2605.03862, 2026

  41. [41]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  42. [42]

    The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions,

    G. Philipp, D. Song, and J. G. Carbonell, “The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions,” 2018. [Online]. Available: https://arxiv.org/abs/1712. 05577

  43. [43]

    Resnet: Solving vanishing gradient in deep networks,

    L. Borawar and R. Kaur, “Resnet: Solving vanishing gradient in deep networks,” inProceedings of International Conference on Recent Trends in Computing: ICRTC 2022. Springer, 2023, pp. 235–247

  44. [44]

    Stability and convergence theory for learning resnet: A full characterization,

    H. Zhang, D. Yu, M. Yi, W. Chen, and T.-y. Liu, “Stability and convergence theory for learning resnet: A full characterization,” 2019

  45. [45]

    What can resnet learn efficiently, going beyond kernels?

    Z. Allen-Zhu and Y . Li, “What can resnet learn efficiently, going beyond kernels?”Advances in Neural Information Processing Systems, vol. 32, 2019

  46. [46]

    Focal Loss for Dense Object Detection

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Local loss for dense object detection,” 2018. [Online]. Available: https: //arxiv.org/abs/1708.02002

  47. [47]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

  48. [48]

    Deep networks with stochastic depth,

    G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inEuropean conference on computer vision. Springer, 2016, pp. 646–661

  49. [49]

    Training Deep Nets with Sublinear Memory Cost

    T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016. [Online]. Available: https://arxiv.org/abs/1604.06174