Recognition: 2 theorem links
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Pith reviewed 2026-05-15 05:15 UTC · model grok-4.3
The pith
RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTT layers instantiate the hidden state as a trainable model and replace the usual recurrence with a step of self-supervised learning performed on the test sequence. For the two concrete cases examined, TTT-Linear uses a linear model and TTT-MLP uses a two-layer network; both keep lowering perplexity when conditioned on longer contexts, while a strong Mamba baseline plateaus after 16k tokens. The evaluation covers models from 125M to 1.3B parameters and directly compares against a Transformer baseline.
What carries the argument
The TTT layer, whose hidden state is itself a small model updated by one or more gradient steps of self-supervised learning on the current test sequence.
If this is right
- Linear-complexity layers can continue to benefit from additional context beyond the point where fixed-state RNNs saturate.
- The same architecture family can be scaled from 125M to over a billion parameters while preserving the long-context scaling behavior.
- Memory and compute trade-offs shift from attention's quadratic growth to the cost of storing and updating the internal model parameters.
- Future layer designs can focus on improving the I/O efficiency of the gradient steps without changing the core recurrence.
Where Pith is reading between the lines
- Dynamic adaptation of the hidden state could reduce reliance on extremely long fixed context windows if the model learns useful patterns from recent tokens alone.
- The same mechanism might be applied to online settings where new data arrives continuously and the model must improve without a separate training phase.
- If the internal model can be made lighter, TTT layers could serve as drop-in replacements for attention in resource-constrained inference environments.
Load-bearing premise
Gradient-based self-supervised updates performed on the hidden-state model during inference stay stable, cheap enough to run, and do not overfit or degrade the output.
What would settle it
A controlled run in which TTT-Linear or TTT-MLP stops improving perplexity after 16k tokens or begins to produce unstable outputs when the test-time updates are enabled.
read the original abstract
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Test-Time Training (TTT) layers as a framework for sequence modeling with linear complexity but expressive hidden states. The hidden state is instantiated as a learnable model (linear regressor or 2-layer MLP) whose parameters are updated via self-supervised gradient steps on the input sequence at test time. Two variants, TTT-Linear and TTT-MLP, are evaluated at 125M–1.3B parameter scales against a strong Transformer baseline and Mamba; the key empirical result is that TTT models continue to reduce perplexity as context grows beyond 16k tokens while Mamba plateaus.
Significance. If the central empirical claim holds, the work supplies a concrete route to linear-complexity models whose hidden states adapt via test-time learning, yielding continued gains on long contexts where standard RNNs saturate. The scaling experiments to 1.3B parameters and direct head-to-head comparisons with Mamba and Transformer constitute reproducible empirical evidence that strengthens the case for test-time adaptation as a viable direction.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.
- [§3 (Method)] §3 (Method): The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.
minor comments (2)
- [Abstract] Abstract and §4: The phrase 'memory I/O issues for TTT-MLP' is stated without any quantitative breakdown (e.g., peak memory vs. context length or wall-clock overhead relative to Mamba). Adding a short table or plot would clarify the practical limitation.
- [§3 (Method)] Notation in §3: The symbols for the hidden-state model parameters and the self-supervised loss are introduced without an explicit table of definitions, making cross-references to the update equations harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of TTT layers for long-context scaling. We address each major comment below and will incorporate the requested details and analyses into the revised manuscript.
read point-by-point responses
-
Referee: [§4 (Experiments)] The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.
Authors: We agree that stability analysis is necessary to support the central empirical claim. In the revised version we will add to §4 new figures and text reporting (i) gradient-norm trajectories during test-time updates on long sequences, (ii) per-step self-supervised loss curves on held-out segments, (iii) explicit checks for divergence or instability, and (iv) ablation tables showing sensitivity of final perplexity to the number of gradient steps and the learning-rate schedule used at test time. These additions will directly substantiate that the observed scaling advantage arises from stable, beneficial updates. revision: yes
-
Referee: [§3 (Method)] The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.
Authors: We acknowledge the omission of precise test-time hyperparameters. The revised §3 will explicitly state the optimizer (Adam with β1=0.9, β2=0.999), the exact number of gradient steps performed per token or per segment, the learning-rate value and any decay schedule, and the regularization applied (weight decay of 0.01 together with gradient clipping at norm 1.0). These details will be provided for both TTT-Linear and TTT-MLP so that readers can verify tractability and reproducibility at the 1.3B scale. revision: yes
Circularity Check
No significant circularity; architectural proposal with direct empirical validation
full rationale
The paper defines TTT layers by making the hidden state itself a learnable model (linear or 2-layer MLP) whose parameters are updated via a self-supervised gradient step on each test token or segment. This is an explicit architectural choice, not a mathematical derivation that reduces to prior equations or fitted inputs. No load-bearing self-citations, uniqueness theorems from the same authors, or ansatzes smuggled via prior work appear in the core construction. The central scaling claim (TTT continues reducing perplexity beyond 16k tokens while Mamba plateaus) rests on direct experimental comparisons at 125M–1.3B scale rather than any reduction of outputs to inputs by construction. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised gradient updates on a small model serving as hidden state improve expressiveness without instability at test time
invented entities (1)
-
TTT layer
no independent evidence
Forward citations
Cited by 21 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
A Single-Layer Model Can Do Language Modeling
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Cortico-cerebellar modularity as an architectural inductive bias for efficient temporal learning
CB-RNNs with a cerebellar feedforward module learn temporal tasks faster than matched RNNs, with the module driving efficiency even after freezing the recurrent core as a fixed reservoir.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
Measuring Accuracy and Energy-to-Solution of Quantum Fine-Tuning of Foundational AI Models
Trapped-ion quantum fine-tuning of AI models shows linear energy scaling and 24% better classification error than classical logistic regression or SVM baselines, with a projected energy break-even at 34 qubits.
-
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Learning to learn by gradient descent by gradient descent
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016
work page 2016
-
[3]
You just found out your book was used to train ai
Authors Guild. You just found out your book was used to train ai. now what?, 2023. Accessed: 2024-06-24
work page 2023
-
[4]
xlstm: Ex- tended long short-term memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. arXiv preprint arXiv:2405.04517, 2024
-
[5]
Learning a synaptic learning rule
Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Citeseer, 1990
work page 1990
-
[6]
The nadaraya-watson kernel regression function estimator
Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. (Serie Research Memoranda; No. 1988-58). Faculty of Economics and Business Administration, Vrije Universiteit Amsterdam., 1988
work page 1988
-
[7]
Pattern recognition and machine learning , volume 4
Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning , volume 4. Springer, 2006
work page 2006
-
[8]
Gpt-neox-20b: An open-source autoregressive language model
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022
-
[9]
Local learning algorithms.Neural computation, 4(6):888–900, 1992
Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992
work page 1992
-
[10]
Variable kernel estimates of multivariate densities
Leo Breiman, William Meisel, and Edward Purcell. Variable kernel estimates of multivariate densities. Technometrics, 19(2):135–144, 1977
work page 1977
-
[11]
Weighted nadaraya–watson regression estimation
Zongwu Cai. Weighted nadaraya–watson regression estimation. Statistics & probability letters, 51(3):307–318, 2001
work page 2001
-
[12]
Training deep nets with sublinear memory cost, 2016
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016
work page 2016
-
[13]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[14]
A tutorial on kernel density estimation and recent advances
Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017
work page 2017
-
[15]
Meta-learning fast weight language models
Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, and Moham- mad Norouzi. Meta-learning fast weight language models. arXiv preprint arXiv:2212.02475, 2022
-
[16]
Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou, and Thorsten Joachims. Large scale transductive svms. Journal of Machine Learning Research, 7(8), 2006
work page 2006
-
[17]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024. 20
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for e fficient language models. arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
In the long (context) run, 2023
Harm de Vries. In the long (context) run, 2023. Accessed: 2024-06-24
work page 2023
-
[20]
Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982
Jerome A Feldman. Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982
work page 1982
-
[21]
Model-agnostic meta-learning for fast adapta- tion of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017
work page 2017
-
[22]
A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998
work page 1998
-
[23]
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[24]
The pile: An 800gb dataset of diverse text for language modeling, 2020
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020
work page 2020
-
[25]
EasyLM: A Simple And Scalable Training Framework for Large Language Models
Xinyang Geng. EasyLM: A Simple And Scalable Training Framework for Large Language Models. https://github.com/young-geng/EasyLM, mar 2023. https://github.com/ young-geng/EasyLM
work page 2023
-
[26]
Unlocking state-tracking in linear rnns through negative eigenvalues
Riccardo Grazzi, Julien Siems, Arber Zela, Jörg KH Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[27]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Self-supervised policy adaptation during deployment
Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020
-
[29]
Test-time training on nearest neighbors for large language models
Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023
-
[30]
Horace He. Strangely, matrix multiplications on gpus run faster when given "predictable" data! [short], 2024. Accessed: 2024-06-30
work page 2024
-
[31]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
Using fast weights to deblur old memories
Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987
work page 1987
-
[33]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[34]
Rae, Oriol Vinyals, and Laurent Sifre
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page 2022
-
[35]
Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning, pages 9639–9659. PMLR, 2022
work page 2022
-
[36]
Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. Practical computational power of linear transformers and their recurrent and self-referential extensions. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[37]
Neural di fferential equations for learning to program neural nets through continuous learning rules
Kazuki Irie, Francesco Faccio, and Jürgen Schmidhuber. Neural di fferential equations for learning to program neural nets through continuous learning rules. Advances in Neural Information Processing Systems, 35:38614–38628, 2022
work page 2022
-
[38]
Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers.Advances in Neural Information Processing Systems, 34:7703–7717, 2021
work page 2021
-
[39]
A modern self-referential weight matrix that learns to modify itself
Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. A modern self-referential weight matrix that learns to modify itself. In International Conference on Machine Learning , pages 9660–9677. PMLR, 2022
work page 2022
-
[40]
Images as weight matrices: Sequential image generation through synaptic learning rules
Kazuki Irie and Jürgen Schmidhuber. Images as weight matrices: Sequential image generation through synaptic learning rules. International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[41]
Online domain adaptation of a pre-trained cascade of classifiers
Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011
work page 2011
-
[42]
Learning to classify text using support vector machines, volume 668
Thorsten Joachims. Learning to classify text using support vector machines, volume 668. Springer Science & Business Media, 2002
work page 2002
-
[43]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[44]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[45]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[46]
Meta learning backpropagation and improving it
Louis Kirsch and Jürgen Schmidhuber. Meta learning backpropagation and improving it. Advances in Neural Information Processing Systems, 34:14122–14134, 2021
work page 2021
-
[47]
Dynamic evaluation of neural sequence models
Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pages 2766–2775. PMLR, 2018
work page 2018
-
[48]
Dynamic Evaluation of Transformer Language Models
Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[49]
E fficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. E fficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[50]
Building machines that learn and think like people
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017. 22
work page 2017
-
[51]
Building high-level features using large scale unsupervised learning
Quoc V Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595–8598. IEEE, 2013
work page 2013
-
[52]
World model on million-length video and language with blockwise ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024
-
[53]
Consistent video depth estimation
Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4):71–1, 2020
work page 2020
-
[54]
Gradient-based hyperparameter optimization through reversible learning
Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015
work page 2015
-
[55]
Meta-Learning Update Rules for Unsupervised Representation Learning
Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[56]
Online model distillation for efficient video inference
Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference. arXiv preprint arXiv:1812.02699, 2018
-
[57]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011
work page 2011
-
[58]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024
-
[60]
The devil in linear transformer
Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022
-
[61]
The perceptron: a probabilistic model for information storage and organiza- tion in the brain
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza- tion in the brain. Psychological review, 65(6):386, 1958
work page 1958
-
[62]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021
work page 2021
-
[63]
Learning associative inference using fast weight memory
Imanol Schlag, Tsendsuren Munkhdalai, and Jürgen Schmidhuber. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831, 2020
-
[64]
Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-
Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987
work page 1987
-
[65]
Learning to control fast-weight memories: An alternative to dynamic recurrent networks
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992
work page 1992
-
[66]
Glu variants improve transformer, 2020
Noam Shazeer. Glu variants improve transformer, 2020
work page 2020
-
[67]
Normformer: Improved transformer pretraining with extra normalization
Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with extra normalization. arXiv preprint arXiv:2110.09456, 2021. 23
- [68]
-
[69]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[70]
Learning to (learn at test time)
Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time). arXiv preprint arXiv:2310.13807, 2023
-
[71]
Online learning of unknown dynamics for model-based controllers in legged locomotion
Yu Sun, Wyatt L Ubellacker, Wen-Loong Ma, Xiang Zhang, Changhao Wang, Noel V Csomay- Shanklin, Masayoshi Tomizuka, Koushil Sreenath, and Aaron D Ames. Online learning of unknown dynamics for model-based controllers in legged locomotion. IEEE Robotics and Automation Letters, 6(4):8442–8449, 2021
work page 2021
-
[72]
Test-time training with self-supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020
work page 2020
-
[73]
Learning to learn: Introduction and overview
Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998
work page 1998
-
[74]
Using fast weights to improve persistent contrastive divergence
Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th annual international conference on machine learning, pages 1033–1040, 2009
work page 2009
-
[75]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[76]
The nature of statistical learning theory
Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013
work page 2013
-
[77]
Extracting and composing robust features with denoising autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, page 1096–1103, 2008
work page 2008
-
[78]
The correlation theory of brain function
Christoph Von Der Malsburg. The correlation theory of brain function. In Models of neural networks: Temporal aspects of coding and information processing in biological systems , pages 95–119. Springer, 1994
work page 1994
-
[79]
Test-time training on video streams
Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A Efros, and Xiaolong Wang. Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023
-
[80]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 24
work page internal anchor Pith review Pith/arXiv arXiv 1910
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.