Training deep nets with sublinear memory cost, 2016

Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin · 2016

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

cs.LG · 2026-03-05 · unverdicted · novelty 6.0

BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.

citing papers explorer

Showing 3 of 3 citing papers.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 12
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers cs.LG · 2026-05-07 · unverdicted · none · ref 43
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation" cs.LG · 2026-03-05 · unverdicted · none · ref 1
BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.

Training deep nets with sublinear memory cost, 2016

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer