Training Deep Nets with Sublinear Memory Cost

Bing Xu; Carlos Guestrin; Chiyuan Zhang; Tianqi Chen

arxiv: 1604.06174 · v2 · submitted 2016-04-21 · 💻 cs.LG

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen , Bing Xu , Chiyuan Zhang , Carlos Guestrin This is my paper

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords deep neural network trainingmemory optimizationcheckpointingcomputation graph analysissublinear memoryresidual networksrecurrent neural networksGPU memory reduction

0 comments

The pith

An algorithm trains an n-layer deep network using O(sqrt(n)) memory at the cost of one extra forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to train deep neural networks using only the square root of the number of layers in memory. It works by storing checkpoints at regular intervals and recomputing the missing activations during the backward pass. A sympathetic reader would care because many state-of-the-art models are limited by GPU memory, and this allows deeper and more complex models without additional hardware. The approach uses computation graph analysis for automatic in-place operations and memory sharing. Experiments show large reductions such as training a 1000-layer residual network with far less memory.

Core claim

We design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also 7G

What carries the argument

The checkpointing strategy that segments the computation graph into sqrt(n) intervals, storing activations only at boundaries and recomputing forwards inside each interval during backpropagation.

If this is right

A 1000-layer residual network trains with memory reduced from 48G to 7G and only 30 percent extra running time on ImageNet.
Complex recurrent neural networks become trainable on very long sequences with substantially lower memory.
State-of-the-art models no longer hit GPU memory limits as quickly, enabling exploration of deeper architectures.
An extreme variant reduces memory to O(log n) at the cost of O(n log n) extra forward computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower hardware barriers for training large models and make advanced deep learning more accessible on modest GPUs.
Adaptive checkpoint intervals based on per-layer compute cost might improve the compute-memory trade-off further.
The method pairs naturally with model parallelism to scale to even larger networks without changing the core algorithm.
Systems with high compute throughput relative to memory bandwidth would see the smallest effective overhead from the extra forward passes.

Load-bearing premise

The computation graph can be cleanly segmented into sqrt(n) intervals where recomputing forward passes inside each interval is both correct and cheaper than storing all intermediate activations.

What would settle it

Running the algorithm on a 1000-layer residual network and measuring whether peak memory usage scales as O(sqrt(n)), total runtime increases by about 30 percent, and the resulting gradients match those from full-storage training.

read the original abstract

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Key takeaway: O(sqrt(n)) memory for deep net training via graph segmentation and recomputation, with solid experiments.

read the letter

The key thing to know is that this paper gives a way to train an n-layer network using O(sqrt(n)) memory instead of O(n), by storing only sqrt(n) activations and recomputing the segments during back-propagation, at the cost of one extra forward pass per batch. The new part is the detailed analysis of the DNN computation graph to make this work automatically, including in-place operations and memory sharing. They show how to segment the chain into intervals and derive the bounds directly from that. It does well by delivering both the theory and real results: the 1000-layer ResNet memory drops from 48G to 7G with 30% extra time on ImageNet, and they demonstrate it on complex RNNs too. The numbers line up with the predicted trade-off, and there's no circular reasoning in the bounds. The main soft spot is the assumption that the graph segments cleanly so that recomputing inside each interval is efficient and correct. Their experiments confirm it for the tested cases, but for graphs with lots of branches or custom ops, it might require more work to get the optimal segmentation. That's minor given how well it performs on standard deep models. This is useful for anyone training deep or recurrent nets where memory limits what they can try. It builds on existing checkpointing ideas but applies them systematically to modern DNNs with clear complexity results. I would cite this and bring it to a reading group. It deserves peer review because the central claim holds up with matching theory and practice. Recommendation: Yes, send it to referees.

Referee Report

0 major / 2 minor

Summary. The manuscript presents an algorithm to train deep neural networks with O(sqrt(n)) memory cost for an n-layer network, incurring only the cost of one extra forward pass per mini-batch. This is achieved through computation graph analysis, segmenting the network into intervals, storing boundary activations, and recomputing forward passes within segments during backpropagation. The approach is extended to O(log n) memory with O(n log n) extra computation, and validated on ImageNet with a 1000-layer ResNet (48G to 7G memory) and long-sequence RNNs.

Significance. If the claims hold, this is a significant contribution to deep learning training efficiency, allowing exploration of deeper models on memory-constrained hardware like GPUs. The systematic use of DAG properties for memory optimization, combined with empirical validation showing memory reduction with modest time overhead and correct gradients, provides a practical tool for advancing DL research. The parameter-free derivation from standard graph segmentation is a strength.

minor comments (2)

[Abstract] Abstract: the O(sqrt(n)) claim would be clearer if it explicitly stated the segmentation assumption (clean intervals where recomputation is correct and cheaper than storing all activations) that underpins the bound.
[Experiments] The 30% extra time cost for the 1000-layer ResNet is reported, but a per-component breakdown (recomputation vs. original forward/backward) would make the compute-memory trade-off more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the assessment of its significance, and the recommendation to accept the manuscript. No major comments requiring response or revision were raised.

read point-by-point responses

Referee: No specific major comments were listed in the report.

Authors: We appreciate the referee's recognition that the algorithm provides a systematic, parameter-free approach to memory reduction via graph segmentation and recomputation, with empirical validation on large models. The description of the O(sqrt(n)) memory bound, the O(log n) extension, and the ImageNet/ResNet and RNN experiments matches our claims exactly. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The O(sqrt(n)) memory bound is obtained by partitioning the n-layer computation DAG into sqrt(n) segments, retaining only the sqrt(n) boundary activations, and performing one recomputation of each segment during back-propagation; the total extra work equals one forward pass by direct operation counting on the graph. This counting argument relies only on standard properties of feed-forward and recurrent DAGs plus the in-place/memory-sharing optimizations described in the paper; no parameters are fitted to data, no result is defined in terms of itself, and no load-bearing step reduces to a self-citation. The reported ImageNet and RNN experiments serve as empirical confirmation rather than definitional inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, axioms beyond standard DAG properties, or invented entities; the contribution is purely algorithmic.

axioms (1)

standard math The forward computation graph is a directed acyclic graph whose nodes correspond to layer activations.
Invoked when analyzing memory storage and recomputation segments.

pith-pipeline@v0.9.0 · 5519 in / 1076 out tokens · 40927 ms · 2026-05-12T03:37:54.105843+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
cs.CV 2026-05 unverdicted novelty 8.0

MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.
Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
cs.LG 2026-05 unverdicted novelty 7.0

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
math.OC 2026-05 unverdicted novelty 7.0

An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
cs.LG 2026-05 unverdicted novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Data-guided finite-volume PINNs for 2D shallow water equations avoid trivial low-momentum collapse via sparse measurements, achieving up to 22x error reduction on benchmarks and accurate surrogates on real river data.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
cs.DC 2026-05 conditional novelty 7.0

ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency t...
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
cs.SE 2026-04 unverdicted novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
cs.LG 2026-04 unverdicted novelty 7.0

STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...
Training-Free Inference for High-Resolution Sinogram Completion
cs.CV 2025-06 unverdicted novelty 7.0

HRSino is a training-free adaptive diffusion inference approach for high-resolution sinogram completion that reduces peak memory by up to 30.81% and inference time by up to 17.58% while maintaining accuracy.
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
cs.CL 2024-12 unverdicted novelty 7.0

GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
cs.LG 2024-03 conditional novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
cs.LG 2024-02 unverdicted novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
Moonwalk: Inverse-Forward Differentiation
cs.LG 2024-02 unverdicted novelty 7.0

Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop run...
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
Ring Attention with Blockwise Transformers for Near-Infinite Context
cs.CL 2023-10 unverdicted novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
cs.LG 2019-10 accept novelty 7.0

ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Generating Long Sequences with Sparse Transformers
cs.LG 2019-04 unverdicted novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
Towards Understanding Self-Pretraining for Sequence Classification
cs.LG 2026-05 unverdicted novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
cs.LG 2026-05 unverdicted novelty 6.0

Njord is a probabilistic GNN model using latent variables and adaptive K-means meshes that produces ensemble forecasts and outperforms deterministic ML baselines on global OceanBench and Baltic Sea domains.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces
cs.LG 2026-05 unverdicted novelty 6.0

LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 6.0

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
cs.CV 2026-04 conditional novelty 6.0

SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietar...
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
cs.CV 2026-04 unverdicted novelty 6.0

SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
Quantum Dynamics via Score Matching on Bohmian Trajectories
quant-ph 2026-04 unverdicted novelty 6.0

Neural networks learn the score of the probability density on Bohmian trajectories to recover exact Schrödinger dynamics via self-consistent minimization for nodeless wave functions, demonstrated on double-well splitt...
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Streaming Structured Inference with Flash-SemiCRF
cs.LG 2026-04 unverdicted novelty 6.0

Flash-SemiCRF enables exact semi-CRF inference on long sequences by evaluating edge potentials from compact prefix sums and streaming the forward-backward pass while preserving exact gradients.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
cs.CV 2026-04 unverdicted novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training
cs.LG 2026-02 unverdicted novelty 6.0

GeoPT pre-trains on over one million geometry samples augmented with synthetic dynamics to improve neural physics simulators on fluid and solid mechanics benchmarks while reducing labeled data needs by 20-60% and acce...
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
cs.CL 2025-10 conditional novelty 6.0

MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without ac...
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
cs.LG 2025-09 unverdicted novelty 6.0

CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
cs.IR 2025-09 unverdicted novelty 6.0

MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
SpikingBrain: Spiking Brain-inspired Large Models
cs.LG 2025-09 unverdicted novelty 6.0

SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
cs.LG 2025-06 conditional novelty 6.0

MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
cs.CV 2025-01 conditional novelty 6.0

Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
GWT: Scalable Optimizer State Compression for Large Language Model Training
cs.LG 2025-01 unverdicted novelty 6.0

GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
Transolver: A Fast Transformer Solver for PDEs on General Geometries
cs.LG 2024-02 conditional novelty 6.0

Transolver learns intrinsic physical states from discretized meshes by adaptively splitting domains into flexible learnable slices and computing attention over physics-aware tokens, achieving state-of-the-art PDE solv...
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
cs.CV 2023-09 conditional novelty 6.0

DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
Linformer: Self-Attention with Linear Complexity
cs.LG 2020-06 conditional novelty 6.0

Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
torchtune: PyTorch native post-training library
cs.LG 2026-05 unverdicted novelty 5.0

torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
Instant GPU Efficiency Visibility at Fleet Scale
cs.DC 2026-05 unverdicted novelty 5.0

OFU is a hardware-counter metric that approximates application MFU to within 2 percentage points after tile correction and shows r=0.78 correlation on 608 production jobs.
Replacement Learning: Training Neural Networks with Fewer Parameters
cs.CV 2026-05 unverdicted novelty 5.0

Replacement Learning replaces selected blocks in CNNs and ViTs with learnable parameter-fusion surrogates derived from adjacent layers to reduce full-depth backpropagation redundancy.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 68 Pith papers · 1 internal anchor

[1]

Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good- fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man ´e, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015
[2]

Seltzer, Malcolm Slaney, Andreas Stolcke, Yongqiang Wang, Huaming Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig

Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Part...

work page 2014
[3]

Aho, Ravi Sethi, and Jeffrey D

Alfred V . Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986

work page 1986
[4]

Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improve- ments. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012

work page 2012
[5]

Theano: a CPU and GPU math expression compiler

James Bergstra, Olivier Breuleux, Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guil- laume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June 2010. Oral Presentation

work page 2010
[6]

MXNet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, , and Zheng Zhang. MXNet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys’15), 2015

work page 2015
[7]

Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V . Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y . Ng. Large scale distributed deep networks. In NIPS, 2012

work page 2012
[8]

Deep learning

Ian Goodfellow, Yoshua Bengio, , and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016

work page 2016
[9]

Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation

Andreas Griewank and Andrea Walther. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw., 26(1):19–45, March 2000

work page 2000
[10]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Identity Mappings in Deep Residual Networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016

work page Pith review arXiv 2016
[12]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(8):1735–1780, November 1997. 11

work page 1997
[13]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32th International Conference on Machine Learning (ICML’15), 2015

work page 2015
[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 , pages 1097–1105. 2012

work page 2012
[15]

Gradient-based learning applied to document recognition

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Pro- cessing, pages 306–351. IEEE Press, 2001

work page 2001
[16]

Virtualizing deep neural networks for memory-efﬁcient neural network design.arXiv preprint arXiv:1602.08124, 2016

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulﬁqar, and Stephen W Keckler. Virtualizing deep neural networks for memory-efﬁcient neural network design.arXiv preprint arXiv:1602.08124, 2016

work page arXiv 2016
[17]

Senior, and Franc ¸oise Beaufays

Hasim Sak, Andrew W. Senior, and Franc ¸oise Beaufays. Long short-term memory recur- rent neural network architectures for large scale acoustic modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014

work page 2014
[18]

Training very deep networks

Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Training very deep networks. arXiv preprint arXiv:1507.06228, 2015

work page arXiv 2015
[19]

Highway long short-term memory rnns for distant speech recognition

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. arXiv preprint arXiv:1510.08983, 2015. A Search over Budget B Alg. 3 allows us to generate an optimized memory plan given a single parameterB. This algorithm relies on approximate memory estimation for faste...

work page arXiv 2015

[1] [1]

Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good- fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man ´e, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015

[2] [2]

Seltzer, Malcolm Slaney, Andreas Stolcke, Yongqiang Wang, Huaming Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig

Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Part...

work page 2014

[3] [3]

Aho, Ravi Sethi, and Jeffrey D

Alfred V . Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986

work page 1986

[4] [4]

Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improve- ments. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012

work page 2012

[5] [5]

Theano: a CPU and GPU math expression compiler

James Bergstra, Olivier Breuleux, Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guil- laume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June 2010. Oral Presentation

work page 2010

[6] [6]

MXNet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, , and Zheng Zhang. MXNet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys’15), 2015

work page 2015

[7] [7]

Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V . Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y . Ng. Large scale distributed deep networks. In NIPS, 2012

work page 2012

[8] [8]

Deep learning

Ian Goodfellow, Yoshua Bengio, , and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016

work page 2016

[9] [9]

Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation

Andreas Griewank and Andrea Walther. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw., 26(1):19–45, March 2000

work page 2000

[10] [10]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Identity Mappings in Deep Residual Networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016

work page Pith review arXiv 2016

[12] [12]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(8):1735–1780, November 1997. 11

work page 1997

[13] [13]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32th International Conference on Machine Learning (ICML’15), 2015

work page 2015

[14] [14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 , pages 1097–1105. 2012

work page 2012

[15] [15]

Gradient-based learning applied to document recognition

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Pro- cessing, pages 306–351. IEEE Press, 2001

work page 2001

[16] [16]

Virtualizing deep neural networks for memory-efﬁcient neural network design.arXiv preprint arXiv:1602.08124, 2016

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulﬁqar, and Stephen W Keckler. Virtualizing deep neural networks for memory-efﬁcient neural network design.arXiv preprint arXiv:1602.08124, 2016

work page arXiv 2016

[17] [17]

Senior, and Franc ¸oise Beaufays

Hasim Sak, Andrew W. Senior, and Franc ¸oise Beaufays. Long short-term memory recur- rent neural network architectures for large scale acoustic modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014

work page 2014

[18] [18]

Training very deep networks

Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Training very deep networks. arXiv preprint arXiv:1507.06228, 2015

work page arXiv 2015

[19] [19]

Highway long short-term memory rnns for distant speech recognition

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. arXiv preprint arXiv:1510.08983, 2015. A Search over Budget B Alg. 3 allows us to generate an optimized memory plan given a single parameterB. This algorithm relies on approximate memory estimation for faste...

work page arXiv 2015