TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Training deep nets with sublinear memory cost, 2016
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3roles
other 1polarities
unclear 1representative citing papers
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.
citing papers explorer
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
-
BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"
BASIS uses balanced hashing and invariant scalars to sketch activations, cutting memory to O(L*R*N) while matching exact backprop performance on GPT training at R=32.