pith. sign in

arxiv: 1907.05013 · v1 · pith:DIQMJ5YHnew · submitted 2019-07-11 · 💻 cs.LG · cs.DC· cs.PF

Profiling based Out-of-core Hybrid Method for Large Neural Networks

Pith reviewed 2026-05-24 23:05 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF
keywords out-of-coreneural network trainingGPU memoryprofilingdata swappingrecomputationhybrid methodChainer
0
0 comments X

The pith

A short runtime profiling pass selects per-layer swaps or recomputes to train 50 GB neural networks on 16 GB GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PoocH, which profiles each layer's size and recomputation cost during a short initial run to decide whether to swap its data to CPU memory or recompute it later. This hybrid choice aims to keep total memory under the GPU limit while limiting extra data movement and computation. The authors extended the Chainer framework with PoocH and tested it on networks that would otherwise require three times the available GPU memory. If the profiled decisions remain effective for the full training, the method reduces reliance on multi-GPU hardware or larger-memory accelerators. The evaluation reports concrete slowdowns of 38 percent on x86 and 28 percent on POWER9 relative to fully in-core execution.

Core claim

PoocH determines target layers of swapping or recomputing based on runtime profiling. We implemented PoocH by extending a deep learning framework, Chainer, and we evaluated its performance. With PoocH, we successfully computed an NN requiring 50 GB memory on a single GPU with 16 GB memory. Compared with in-core cases, performance degradation was 38 % on x86 machine and 28 % on POWER9 machine.

What carries the argument

The PoocH hybrid method, which runs a short profiling pass to assign individual layers to either data swapping or recomputation according to measured sizes and costs.

If this is right

  • Neural networks whose memory footprint exceeds a single GPU can still be trained without switching to multi-GPU or distributed setups.
  • The overhead of out-of-core execution stays bounded when swapping and recomputation are chosen layer by layer rather than applied uniformly.
  • The same profiling-driven hybrid logic can be added to other frameworks beyond the Chainer implementation shown.
  • Training runs on both x86 and POWER9 hardware show comparable relative slowdowns, suggesting the approach is not tied to one CPU architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the profiling pass can be made incremental, the method might adapt when layer costs change across training epochs.
  • Combining PoocH with existing gradient-checkpointing tools could push the supported model size even higher on the same hardware.
  • The per-layer decision table produced by profiling could serve as input to automated model-partitioning tools for multi-GPU systems.

Load-bearing premise

A short runtime profiling pass can reliably pick the best swapping versus recomputation choice for each layer and that this choice stays near-optimal for the rest of training without adding large overhead.

What would settle it

Measure actual peak GPU memory and wall-clock time during a full training run that uses the profiled layer decisions; the claim fails if memory still exceeds 16 GB or if slowdown greatly exceeds the reported 28-38 percent.

Figures

Figures reproduced from arXiv: 1907.05013 by Haruki Imai, Kiyokuni Kawachiya, Ryo Matsumiya, Toshio Endo, Tung Le Duc, Yasushi Negishi, Yuki Ito.

Figure 1
Figure 1. Figure 1: shows an example of the structure of NN. An NN is composed of multiple layers, each of which is composed of multiple feature maps. Feature maps of a layer are computed from the previous layer’s feature maps. According to the computation types, each layer is categorized into several groups: convolutional layer, pooling layer, Batch-Normalization (BN) layer, fully￾Input layer Convolution layer Output layer ・… view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of computation of NN (#layers=8) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Memory usage of ResNext101 for 3D data (batch size = 1) GPU. 3 Methods to compute large scale NN In order to deal with data exceeding GPU memory capacity in deep learning, data-swapping method and recomputing method have been proposed. 3.1 Data-swapping method In the data-swapping method, a part of the data used in the forward computation of a layer is swapped out to the CPU memory. An example of swap-out … view at source ↗
Figure 5
Figure 5. Figure 5: Swap-out at forward in data-swapping method (1) Swap-in CPU memory GPU memory X (2) Compute CPU memory GPU memory dX dY X dY Backward (1) Compute CPU memory GPU memory X Y (2) Swap-out CPU memory GPU memory Y X Forward 7 2 3 2 4 3 5 4 6 5 7 6 swap-out compute swap-in 7 4 5 3 4 2 3 1 2 0 1 0 time 0 0 1 1 6 7 5 6 Swap-out start after computation Idle time Forward Backward Computation start after swap-in comp… view at source ↗
Figure 8
Figure 8. Figure 8: Free memory at forward in recomputing method For layers with large computation complexity, recomputation overhead is large. We observe that the two methods can complement each other. Hence, utilizing both methods is promis￾ing. In the later of this paper, we call this approach “hybrid” methods. One of hybrid methods is adopted in SuperNeu￾rons [6]. While this method uses both swapping and recomputing, the … view at source ↗
Figure 10
Figure 10. Figure 10: Example of swap-in scheduling 4.3 Swapping-in scheduling This section describes improvement in pipelined swapping-in execution to reduce communication overhead. In a simple execution of backward com￾putation with swapping (leftside in [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of feature maps that cause overhead by swapping. In this example, swap surrounded 723456 compute 5 0 1 7 6 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Search tree in decision of keep-targets and layer 4: [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Search tree in decision of keep-targets and [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Evaluation of each optimization on x86 machine (speed up for swap-all (w/o scheduling)) 0 0.5 1 1.5 resnet50(256) googlenet(384) alexnet(2432) Speed -up NN (Batch size) swap-all(w/o scheduling) swap-all swap-opt PoocH [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Evaluation of each optimization on POWER9 machine (speed up for swap-all (w/o scheduling)) difference between PoocH and swap-opt for Alexnet is small [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance for ResNet50 on x86 machine 0 100 200 300 128 256 384 512 640 Performance [#images/s] Batch size in-core superneurons PoocH [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: shows the performance on a POWER9 machine. Here performance degradation of PoocH compared with in-core was 2−28%. On this machine, NVLink accelerates CPU-GPU communication, so the overhead of data-swapping is small. As a result, performance degradation in the case of POWER9 was smaller than in the case of x86. We also note that PoocH can capture the differ￾ences in characteristics of the two machines [PI… view at source ↗
Figure 19
Figure 19. Figure 19: Performance for AlexNet on x86 machine 0 400 800 1200 1600 2304 2432 2560 2688 Performance [#images/s] Batch size in-core superneurons PoocH Performance [#images/s] [PITH_FULL_IMAGE:figures/full_fig_p013_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Performance for AlexNet on POWER9 machine hybrid method of data-swapping and recomputing. Their method statically decides classification unlike PoocH. We have shown that PoocH enables faster ex￾ecutions using classification algorithm based on run￾time profiling. Rhu et al. proposed vDNN to compute large scale NNs using data-swapping method [8]. Although vDNN optimize classification of swapping, it does no… view at source ↗
Figure 21
Figure 21. Figure 21: Performance for ResNext101 (3D) on x86 machine 0 500 1000 1500 2000 (224*224)*128 (224*224)*256 (448*224)*128 (448*224)*256 Performance [#voxels/ms] Input size [(height * width) * length] in-core superneurons PoocH [PITH_FULL_IMAGE:figures/full_fig_p014_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Performance for ResNext101 (3D) on POWER9 machine 7 SUMMARY AND FUTURE WORK This paper described PoocH that supports efficient execution of large neural networks that require more memory then GPU memory capacity. It reduces the performance overhead by determining target layers of swapping or recomputing based on runtime profiling. In addition, PoocH schedules swapping-in efficiently. By using PoocH, the p… view at source ↗
read the original abstract

GPUs are widely used to accelerate deep learning with NNs (NNs). On the other hand, since GPU memory capacity is limited, it is difficult to implement efficient programs that compute large NNs on GPU. To compute NNs exceeding GPU memory capacity, data-swapping method and recomputing method have been proposed in existing work. However, in these methods, performance overhead occurs due to data movement or increase of computation. In order to reduce the overhead, it is important to consider characteristics of each layer such as sizes and cost for recomputation. Based on this direction, we proposed Profiling based out-of-core Hybrid method (PoocH). PoocH determines target layers of swapping or recomputing based on runtime profiling. We implemented PoocH by extending a deep learning framework, Chainer, and we evaluated its performance. With PoocH, we successfully computed an NN requiring 50 GB memory on a single GPU with 16 GB memory. Compared with in-core cases, performance degradation was 38 \% on x86 machine and 28 \% on POWER9 machine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes PoocH, a profiling-based out-of-core hybrid method that uses a short runtime profiling pass to decide per-layer whether to apply data swapping or recomputation when training neural networks whose memory footprint exceeds GPU capacity. Implemented as an extension to the Chainer framework, the work claims that a 50 GB model can be trained on a single 16 GB GPU, incurring 38 % slowdown on an x86 machine and 28 % slowdown on a POWER9 machine relative to fully in-core execution.

Significance. If the reported slowdown figures can be reproduced with transparent methodology and shown to stem from the profiling-driven policy rather than from unmeasured overheads, the approach would offer a pragmatic engineering route for fitting larger models onto single-GPU hardware. The absence of any parameter-free derivation or machine-checked component means the contribution rests entirely on the empirical results.

major comments (2)
  1. [Abstract] Abstract: the central claim that PoocH achieves only 38 % / 28 % degradation is presented without any description of measurement protocol, number of runs, error bars, exact baseline implementations (including whether the in-core case used the same Chainer version and optimizations), or the fraction of layers that were actually swapped versus recomputed. These omissions make it impossible to attribute the reported numbers to the hybrid policy itself.
  2. [Evaluation] Evaluation section (implied by the abstract's performance statements): the paper provides no data on profiling duration relative to total training time, no sensitivity study of the swap/recompute decisions, and no comparison against an oracle or dynamic policy. Without these measurements the assumption that a one-time short profile yields a near-optimal static policy for the entire run remains untested and is load-bearing for the low-overhead claim.
minor comments (1)
  1. The abstract refers to 'we evaluated its performance' yet supplies no table or figure numbers; a dedicated evaluation section with raw timing data and layer-wise decisions would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that PoocH achieves only 38 % / 28 % degradation is presented without any description of measurement protocol, number of runs, error bars, exact baseline implementations (including whether the in-core case used the same Chainer version and optimizations), or the fraction of layers that were actually swapped versus recomputed. These omissions make it impossible to attribute the reported numbers to the hybrid policy itself.

    Authors: We agree that the abstract requires more supporting detail. In the revised manuscript we will expand the abstract to note the measurement protocol (averaged over 5 runs with standard deviation), confirm that the in-core baseline used the identical Chainer version and optimizations, and report the fraction of layers chosen for swapping versus recomputation (approximately 40 % swapped in the 50 GB model). These numbers will also be stated explicitly in the evaluation section. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by the abstract's performance statements): the paper provides no data on profiling duration relative to total training time, no sensitivity study of the swap/recompute decisions, and no comparison against an oracle or dynamic policy. Without these measurements the assumption that a one-time short profile yields a near-optimal static policy for the entire run remains untested and is load-bearing for the low-overhead claim.

    Authors: We will add explicit measurements of profiling duration (typically < 3 % of total training time) to the evaluation section. We maintain that a full oracle or dynamic-policy comparison lies outside the scope of the current contribution, which demonstrates a practical, low-overhead hybrid method rather than an optimal policy search. The static policy is justified because per-layer sizes and recomputation costs are invariant for a fixed model and batch size; we will add a short paragraph explaining this property and why the one-time profile suffices for the workloads considered. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical profiling method with no equations, fits, or self-citation reductions

full rationale

The paper presents PoocH as a practical engineering technique that runs a short runtime profiling pass to select per-layer swap versus recompute decisions for out-of-core execution. No derivation chain, equations, fitted parameters, uniqueness theorems, or ansatzes are described in the abstract or provided text. The central performance claims (50 GB model on 16 GB GPU with 28-38% slowdown) are reported as measured outcomes of the implemented system rather than quantities obtained by algebraic reduction or self-referential fitting. No load-bearing self-citations or renamings of prior results appear. The method is therefore self-contained as an empirical heuristic without circular reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the provided abstract; the contribution is presented as an engineering implementation rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5745 in / 1122 out tokens · 17607 ms · 2026-05-24T23:05:37.243130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Training Deeper Models by GPU Memory Opti- mization on TensorFlow

    C.Meng, M.Sun, J.Yang, M.Qiu, and Y.Gu. Training Deeper Models by GPU Memory Opti- mization on TensorFlow. In Proceedings of ML Systems Workshop in NIPS , 2017

  2. [2]

    Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

    F.Seide, G.Li, and D.Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In Proceedings of Interspeech, pages 437–440, 2011

  3. [3]

    Deep residual learning for image recognition

    H.Kaiming, Z.Xiangyu, R.Shaoqing, and S.Jian. Deep residual learning for image recognition. In Proceedings of CVPR, 2016

  4. [4]

    Can spa- tiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of CVPR, 2018

    K.Hara, H.Kataoka, and Y.Satoh. Can spa- tiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of CVPR, 2018

  5. [5]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. Hinton. Im- ageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS, pages 1097–1105, 2012

  6. [6]

    SuperNeurons: Dynamic GPU memory management for train- ing deep neural networks

    L.Wang, J.Ye, Y.Zhao, W.Wu, A.Li, S.Song, Zenglin Xu, and Tim Kraska. SuperNeurons: Dynamic GPU memory management for train- ing deep neural networks. In Proceedings of PPoPP, 2018

  7. [7]

    Large Model Support for Deep Learning in Caffe and Chainer

    M.Cho, T.Le, U.Finkler, H.Imai, Y.Negishi, T.Sekiyama, S.Vinod, V.Zolotov, K.Kawachiya, D.Kung, and H.Hunter. Large Model Support for Deep Learning in Caffe and Chainer. In Pro- ceedings of SysML, 2018

  8. [8]

    vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neu- ral Network Design

    M.Rhu, N.Gimelshein, J.Clemons, A.Zulfiqar, and S.Keckler. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neu- ral Network Design. In Proceedings of MICRO, pages 1–13, 2016

  9. [9]

    On optimization methods for deep learning

    Q.Le, J.Ngiam, A.Coates, A.Lahiri, B.Prochnow, and A.Ng. On optimization methods for deep learning. In Proceedings of ICML, pages 265–272, 2011. 14

  10. [10]

    Radford, L

    A. Radford, L. Metz, and S. Chintala. Unsu- pervised representation learning with deep con- volutional generative adversarial networks. In Proceedings of ICLR, 2016

  11. [11]

    cuDNN: Efficient Primitives for Deep Learning

    S.Chetlur, C.Woolley, P.Vandermersch, J.Cohen, J.Tran, B.Catanzaro, and E.Shelhamer. cuDNN: Efficient Primitives for Deep Learning. 2014

  12. [12]

    Going deeper with convolutions

    S.Christian, L.Wei, J.Yangqing, S.Pierre, R.Scott, A.Dragomir, E.Dumitru, V.Vincent, and R.Andrew. Going deeper with convolutions. In Proceedings of CVPR, 2015

  13. [13]

    Chainer: a Next-Generation Open Source Framework for Deep Learning

    S.Tokui, K.Oono, S.Hido, and J.Clayton. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of ML Systems Workshop in NIPS , 2015

  14. [14]

    Aggregated residual transformations for deep neural networks

    S.Xie, R.Girshick, P.Dollar, Z.Tu, and K.He. Aggregated residual transformations for deep neural networks. In Proceedings of CVPR, 2017

  15. [15]

    Train- ing Deep Nets with Sublinear Memory Cost

    T.Chen, B.Xu, C.Zhang, and C.Guestrin. Train- ing Deep Nets with Sublinear Memory Cost. 2016

  16. [16]

    A neural conver- sational model

    Oriol Vinyals and Quoc V Le. A neural conver- sational model. In Proceedings ofLearning Work- shop in ICML , 2015

  17. [17]

    ooc cuDNN: Accommodating Convolutional Neural Networks over GPU Memory Capacity

    Y.Ito, R.Matsumiya, and T.Endo. ooc cuDNN: Accommodating Convolutional Neural Networks over GPU Memory Capacity. In Proceedings of BigData, 2017. 15