{"work":{"id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","openalex_id":null,"doi":null,"arxiv_id":"1604.06174","raw_key":null,"title":"Training Deep Nets with Sublinear Memory Cost","authors":null,"authors_text":"Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin","year":2016,"venue":"cs.LG","abstract":"We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.","external_url":"https://arxiv.org/abs/1604.06174","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:45:32.940919+00:00","pith_arxiv_id":"1604.06174","created_at":"2026-05-08T23:54:27.079146+00:00","updated_at":"2026-05-25T08:45:32.940919+00:00","title_quality_ok":true,"display_title":"Training Deep Nets with Sublinear Memory Cost","render_title":"Training Deep Nets with Sublinear Memory Cost"},"hub":{"state":{"work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":68,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2019-04-23T19:29:47+00:00","last_pith_cited_at":"2026-05-20T17:32:08+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T20:36:52.136391+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":12},{"context_role":"method","n":8},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":11},{"context_polarity":"use_method","n":8},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:49:50.276443+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":8},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":7},{"title":"Mixed Precision Training","work_id":"c525941b-ce20-4bcb-8509-a9968f1e89c3","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":5},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":5},{"title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding","work_id":"1bb6fb0c-482d-43cf-94a8-ed18f72a5563","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"A simple method for commonsense reasoning","work_id":"a8423cfc-3f91-4307-9c05-02cfd6a0c714","shared_citers":4},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":4},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","work_id":"f3dc32a4-cf81-467b-8ff4-3b2f21d3bf1f","shared_citers":3},{"title":"arXiv preprint arXiv:1806.03377 , year=","work_id":"335ca03b-43f7-43d8-af32-3eaeb6735100","shared_citers":3},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":3}],"time_series":[{"n":3,"year":2019},{"n":2,"year":2020},{"n":4,"year":2022},{"n":6,"year":2023},{"n":2,"year":2024},{"n":1,"year":2025},{"n":21,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:49:54.010525+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:49:37.655418+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Training Deep Nets with Sublinear Memory Cost","claims":[{"claim_text":"We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Training Deep Nets with Sublinear Memory Cost because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:49:25.713958+00:00"}},"summary":{"title":"Training Deep Nets with Sublinear Memory Cost","claims":[{"claim_text":"We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Training Deep Nets with Sublinear Memory Cost because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":8},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":7},{"title":"Mixed Precision Training","work_id":"c525941b-ce20-4bcb-8509-a9968f1e89c3","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":5},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":5},{"title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding","work_id":"1bb6fb0c-482d-43cf-94a8-ed18f72a5563","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"A simple method for commonsense reasoning","work_id":"a8423cfc-3f91-4307-9c05-02cfd6a0c714","shared_citers":4},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":4},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","work_id":"f3dc32a4-cf81-467b-8ff4-3b2f21d3bf1f","shared_citers":3},{"title":"arXiv preprint arXiv:1806.03377 , year=","work_id":"335ca03b-43f7-43d8-af32-3eaeb6735100","shared_citers":3},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":3}],"time_series":[{"n":3,"year":2019},{"n":2,"year":2020},{"n":4,"year":2022},{"n":6,"year":2023},{"n":2,"year":2024},{"n":1,"year":2025},{"n":21,"year":2026}],"dependency_candidates":[]},"authors":[]}}