Training Deep Nets with Sublinear Memory Cost
Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3
The pith
An algorithm trains an n-layer deep network using O(sqrt(n)) memory at the cost of one extra forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also 7G
What carries the argument
The checkpointing strategy that segments the computation graph into sqrt(n) intervals, storing activations only at boundaries and recomputing forwards inside each interval during backpropagation.
If this is right
- A 1000-layer residual network trains with memory reduced from 48G to 7G and only 30 percent extra running time on ImageNet.
- Complex recurrent neural networks become trainable on very long sequences with substantially lower memory.
- State-of-the-art models no longer hit GPU memory limits as quickly, enabling exploration of deeper architectures.
- An extreme variant reduces memory to O(log n) at the cost of O(n log n) extra forward computation.
Where Pith is reading between the lines
- The approach could lower hardware barriers for training large models and make advanced deep learning more accessible on modest GPUs.
- Adaptive checkpoint intervals based on per-layer compute cost might improve the compute-memory trade-off further.
- The method pairs naturally with model parallelism to scale to even larger networks without changing the core algorithm.
- Systems with high compute throughput relative to memory bandwidth would see the smallest effective overhead from the extra forward passes.
Load-bearing premise
The computation graph can be cleanly segmented into sqrt(n) intervals where recomputing forward passes inside each interval is both correct and cheaper than storing all intermediate activations.
What would settle it
Running the algorithm on a 1000-layer residual network and measuring whether peak memory usage scales as O(sqrt(n)), total runtime increases by about 30 percent, and the resulting gradients match those from full-storage training.
read the original abstract
We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an algorithm to train deep neural networks with O(sqrt(n)) memory cost for an n-layer network, incurring only the cost of one extra forward pass per mini-batch. This is achieved through computation graph analysis, segmenting the network into intervals, storing boundary activations, and recomputing forward passes within segments during backpropagation. The approach is extended to O(log n) memory with O(n log n) extra computation, and validated on ImageNet with a 1000-layer ResNet (48G to 7G memory) and long-sequence RNNs.
Significance. If the claims hold, this is a significant contribution to deep learning training efficiency, allowing exploration of deeper models on memory-constrained hardware like GPUs. The systematic use of DAG properties for memory optimization, combined with empirical validation showing memory reduction with modest time overhead and correct gradients, provides a practical tool for advancing DL research. The parameter-free derivation from standard graph segmentation is a strength.
minor comments (2)
- [Abstract] Abstract: the O(sqrt(n)) claim would be clearer if it explicitly stated the segmentation assumption (clean intervals where recomputation is correct and cheaper than storing all activations) that underpins the bound.
- [Experiments] The 30% extra time cost for the 1000-layer ResNet is reported, but a per-component breakdown (recomputation vs. original forward/backward) would make the compute-memory trade-off more transparent.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, the assessment of its significance, and the recommendation to accept the manuscript. No major comments requiring response or revision were raised.
read point-by-point responses
-
Referee: No specific major comments were listed in the report.
Authors: We appreciate the referee's recognition that the algorithm provides a systematic, parameter-free approach to memory reduction via graph segmentation and recomputation, with empirical validation on large models. The description of the O(sqrt(n)) memory bound, the O(log n) extension, and the ImageNet/ResNet and RNN experiments matches our claims exactly. revision: no
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The O(sqrt(n)) memory bound is obtained by partitioning the n-layer computation DAG into sqrt(n) segments, retaining only the sqrt(n) boundary activations, and performing one recomputation of each segment during back-propagation; the total extra work equals one forward pass by direct operation counting on the graph. This counting argument relies only on standard properties of feed-forward and recurrent DAGs plus the in-place/memory-sharing optimizations described in the paper; no parameters are fitted to data, no result is defined in terms of itself, and no load-bearing step reduces to a self-citation. The reported ImageNet and RNN experiments serve as empirical confirmation rather than definitional inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The forward computation graph is a directed acyclic graph whose nodes correspond to layer activations.
Forward citations
Cited by 60 Pith papers
-
Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
-
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
-
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance
Data-guided finite-volume PINNs for 2D shallow water equations avoid trivial low-momentum collapse via sparse measurements, achieving up to 22x error reduction on benchmarks and accurate surrogates on real river data.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency t...
-
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...
-
Training-Free Inference for High-Resolution Sinogram Completion
HRSino is a training-free adaptive diffusion inference approach for high-resolution sinogram completion that reduces peak memory by up to 30.81% and inference time by up to 17.58% while maintaining accuracy.
-
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
-
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
-
Moonwalk: Inverse-Forward Differentiation
Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop run...
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
-
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Longformer: The Long-Document Transformer
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
-
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
Generating Long Sequences with Sparse Transformers
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
-
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning
ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
-
Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
Njord is a probabilistic GNN model using latent variables and adaptive K-means meshes that produces ensemble forecasts and outperforms deterministic ML baselines on global OceanBench and Baltic Sea domains.
-
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
-
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietar...
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
-
Quantum Dynamics via Score Matching on Bohmian Trajectories
Neural networks learn the score of the probability density on Bohmian trajectories to recover exact Schrödinger dynamics via self-consistent minimization for nodeless wave functions, demonstrated on double-well splitt...
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Streaming Structured Inference with Flash-SemiCRF
Flash-SemiCRF enables exact semi-CRF inference on long sequences by evaluating edge potentials from compact prefix sums and streaming the forward-backward pass while preserving exact gradients.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training
GeoPT pre-trains on over one million geometry samples augmented with synthetic dynamics to improve neural physics simulators on fluid and solid mechanics benchmarks while reducing labeled data needs by 20-60% and acce...
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without ac...
-
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
-
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
-
SpikingBrain: Spiking Brain-inspired Large Models
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
-
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
-
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
-
GWT: Scalable Optimizer State Compression for Large Language Model Training
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
-
Transolver: A Fast Transformer Solver for PDEs on General Geometries
Transolver learns intrinsic physical states from discretized meshes by adaptively splitting domains into flexible learnable slices and computing attention over physics-aware tokens, achieving state-of-the-art PDE solv...
-
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
torchtune: PyTorch native post-training library
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
-
Instant GPU Efficiency Visibility at Fleet Scale
OFU is a hardware-counter metric that approximates application MFU to within 2 percentage points after tile correction and shows r=0.78 correlation on 608 production jobs.
-
Replacement Learning: Training Neural Networks with Fewer Parameters
Replacement Learning replaces selected blocks in CNNs and ViTs with learnable parameter-fusion surrogates derived from adjacent layers to reduce full-depth backpropagation redundancy.
Reference graph
Works this paper leans on
-
[1]
Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good- fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man ´e, Rajat Monga, Sherry Moore, Derek Murra...
work page 2015
-
[2]
Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Part...
work page 2014
-
[3]
Aho, Ravi Sethi, and Jeffrey D
Alfred V . Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986
work page 1986
-
[4]
Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio
Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improve- ments. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012
work page 2012
-
[5]
Theano: a CPU and GPU math expression compiler
James Bergstra, Olivier Breuleux, Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guil- laume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation
work page 2010
-
[6]
MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, , and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys’15), 2015
work page 2015
-
[7]
Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V . Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y . Ng. Large scale distributed deep networks. In NIPS, 2012
work page 2012
-
[8]
Ian Goodfellow, Yoshua Bengio, , and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016
work page 2016
-
[9]
Andreas Griewank and Andrea Walther. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw., 26(1):19–45, March 2000
work page 2000
-
[10]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016
work page Pith review arXiv 2016
-
[12]
Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(8):1735–1780, November 1997. 11
work page 1997
-
[13]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32th International Conference on Machine Learning (ICML’15), 2015
work page 2015
-
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 , pages 1097–1105. 2012
work page 2012
-
[15]
Gradient-based learning applied to document recognition
Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Pro- cessing, pages 306–351. IEEE Press, 2001
work page 2001
-
[16]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. Virtualizing deep neural networks for memory-efficient neural network design.arXiv preprint arXiv:1602.08124, 2016
-
[17]
Senior, and Franc ¸oise Beaufays
Hasim Sak, Andrew W. Senior, and Franc ¸oise Beaufays. Long short-term memory recur- rent neural network architectures for large scale acoustic modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014
work page 2014
-
[18]
Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Training very deep networks. arXiv preprint arXiv:1507.06228, 2015
-
[19]
Highway long short-term memory rnns for distant speech recognition
Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. arXiv preprint arXiv:1510.08983, 2015. A Search over Budget B Alg. 3 allows us to generate an optimized memory plan given a single parameterB. This algorithm relies on approximate memory estimation for faste...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.