pith. sign in

hub Mixed citations

Training Deep Nets with Sublinear Memory Cost

Mixed citation behavior. Most common role is background (52%).

68 Pith papers citing it
Background 52% of classified citations
abstract

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

hub tools

citation-role summary

background 12 method 8 dataset 1

citation-polarity summary

claims ledger

  • abstract We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph a

co-cited works

clear filters

representative citing papers

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

cs.DC · 2026-05-07 · conditional · novelty 7.0

ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency than finite differences on models with up to 1.9 million latent variables.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

Moonwalk: Inverse-Forward Differentiation

cs.LG · 2024-02-22 · unverdicted · novelty 7.0

Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop runtime at over twice the depth.

QLoRA: Efficient Finetuning of Quantized LLMs

cs.LG · 2023-05-23 · conditional · novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Longformer: The Long-Document Transformer

cs.CL · 2020-04-10 · accept · novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

citing papers explorer

Showing 16 of 16 citing papers after filters.

  • Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 4 · internal anchor

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

  • ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations cs.DC · 2026-05-07 · conditional · none · ref 30 · internal anchor

    ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency than finite differences on models with up to 1.9 million latent variables.

  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 5 · internal anchor

    GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

  • Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 7 · internal anchor

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  • QLoRA: Efficient Finetuning of Quantized LLMs cs.LG · 2023-05-23 · conditional · none · ref 9 · internal anchor

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 75 · internal anchor

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  • ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning cs.LG · 2026-05-20 · conditional · none · ref 6 · internal anchor

    ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.

  • SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · conditional · none · ref 7 · 2 links · internal anchor

    SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.

  • MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 34 · internal anchor

    MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.

  • MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation cs.LG · 2025-06-02 · conditional · none · ref 3 · internal anchor

    MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.

  • Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps cs.CV · 2025-01-16 · conditional · none · ref 9 · internal anchor

    Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.

  • Transolver: A Fast Transformer Solver for PDEs on General Geometries cs.LG · 2024-02-04 · conditional · none · ref 3 · internal anchor

    Transolver learns intrinsic physical states from discretized meshes by adaptively splitting domains into flexible learnable slices and computing attention over physics-aware tokens, achieving state-of-the-art PDE solving on general geometries.

  • Directly Fine-Tuning Diffusion Models on Differentiable Rewards cs.CV · 2023-09-29 · conditional · none · ref 4 · internal anchor

    DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.

  • BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 19 · internal anchor

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  • DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection cs.CV · 2022-03-07 · conditional · none · ref 6 · internal anchor

    DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.

  • Linformer: Self-Attention with Linear Complexity cs.LG · 2020-06-08 · conditional · none · ref 3 · internal anchor

    Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.