arxiv: 2407.04620 · v4 · submitted 2024-07-05 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun , Xinhao Li , Karan Dalal , Jiarui Xu , Arjun Vikram , Genghan Zhang , Yann Dubois , Xinlei Chen

show 4 more authors

Xiaolong Wang Sanmi Koyejo Tatsunori Hashimoto Carlos Guestrin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords test-time trainingRNNlong contextsequence modelinglinear complexityhidden stateself-supervised update

0 comments

The pith

RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build sequence layers that combine linear complexity with high expressive power. The hidden state is no longer a fixed vector but a small machine learning model whose parameters are adjusted by gradient descent on the incoming test sequence itself. This lets the layer adapt its representation to the specific data seen so far, so that perplexity keeps falling as more tokens arrive. A reader would care because the approach avoids both the quadratic cost of attention and the performance saturation of conventional RNNs after roughly 16k tokens.

Core claim

TTT layers instantiate the hidden state as a trainable model and replace the usual recurrence with a step of self-supervised learning performed on the test sequence. For the two concrete cases examined, TTT-Linear uses a linear model and TTT-MLP uses a two-layer network; both keep lowering perplexity when conditioned on longer contexts, while a strong Mamba baseline plateaus after 16k tokens. The evaluation covers models from 125M to 1.3B parameters and directly compares against a Transformer baseline.

What carries the argument

The TTT layer, whose hidden state is itself a small model updated by one or more gradient steps of self-supervised learning on the current test sequence.

If this is right

Linear-complexity layers can continue to benefit from additional context beyond the point where fixed-state RNNs saturate.
The same architecture family can be scaled from 125M to over a billion parameters while preserving the long-context scaling behavior.
Memory and compute trade-offs shift from attention's quadratic growth to the cost of storing and updating the internal model parameters.
Future layer designs can focus on improving the I/O efficiency of the gradient steps without changing the core recurrence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic adaptation of the hidden state could reduce reliance on extremely long fixed context windows if the model learns useful patterns from recent tokens alone.
The same mechanism might be applied to online settings where new data arrives continuously and the model must improve without a separate training phase.
If the internal model can be made lighter, TTT layers could serve as drop-in replacements for attention in resource-constrained inference environments.

Load-bearing premise

Gradient-based self-supervised updates performed on the hidden-state model during inference stay stable, cheap enough to run, and do not overfit or degrade the output.

What would settle it

A controlled run in which TTT-Linear or TTT-MLP stops improving perplexity after 16k tokens or begins to produce unstable outputs when the test-time updates are enabled.

read the original abstract

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Test-Time Training (TTT) layers as a framework for sequence modeling with linear complexity but expressive hidden states. The hidden state is instantiated as a learnable model (linear regressor or 2-layer MLP) whose parameters are updated via self-supervised gradient steps on the input sequence at test time. Two variants, TTT-Linear and TTT-MLP, are evaluated at 125M–1.3B parameter scales against a strong Transformer baseline and Mamba; the key empirical result is that TTT models continue to reduce perplexity as context grows beyond 16k tokens while Mamba plateaus.

Significance. If the central empirical claim holds, the work supplies a concrete route to linear-complexity models whose hidden states adapt via test-time learning, yielding continued gains on long contexts where standard RNNs saturate. The scaling experiments to 1.3B parameters and direct head-to-head comparisons with Mamba and Transformer constitute reproducible empirical evidence that strengthens the case for test-time adaptation as a viable direction.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.
[§3 (Method)] §3 (Method): The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.

minor comments (2)

[Abstract] Abstract and §4: The phrase 'memory I/O issues for TTT-MLP' is stated without any quantitative breakdown (e.g., peak memory vs. context length or wall-clock overhead relative to Mamba). Adding a short table or plot would clarify the practical limitation.
[§3 (Method)] Notation in §3: The symbols for the hidden-state model parameters and the self-supervised loss are introduced without an explicit table of definitions, making cross-references to the update equations harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of TTT layers for long-context scaling. We address each major comment below and will incorporate the requested details and analyses into the revised manuscript.

read point-by-point responses

Referee: [§4 (Experiments)] The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.

Authors: We agree that stability analysis is necessary to support the central empirical claim. In the revised version we will add to §4 new figures and text reporting (i) gradient-norm trajectories during test-time updates on long sequences, (ii) per-step self-supervised loss curves on held-out segments, (iii) explicit checks for divergence or instability, and (iv) ablation tables showing sensitivity of final perplexity to the number of gradient steps and the learning-rate schedule used at test time. These additions will directly substantiate that the observed scaling advantage arises from stable, beneficial updates. revision: yes
Referee: [§3 (Method)] The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.

Authors: We acknowledge the omission of precise test-time hyperparameters. The revised §3 will explicitly state the optimizer (Adam with β1=0.9, β2=0.999), the exact number of gradient steps performed per token or per segment, the learning-rate value and any decay schedule, and the regularization applied (weight decay of 0.01 together with gradient clipping at norm 1.0). These details will be provided for both TTT-Linear and TTT-MLP so that readers can verify tractability and reproducibility at the 1.3B scale. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with direct empirical validation

full rationale

The paper defines TTT layers by making the hidden state itself a learnable model (linear or 2-layer MLP) whose parameters are updated via a self-supervised gradient step on each test token or segment. This is an explicit architectural choice, not a mathematical derivation that reduces to prior equations or fitted inputs. No load-bearing self-citations, uniqueness theorems from the same authors, or ansatzes smuggled via prior work appear in the core construction. The central scaling claim (TTT continues reducing perplexity beyond 16k tokens while Mamba plateaus) rests on direct experimental comparisons at 125M–1.3B scale rather than any reduction of outputs to inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that self-supervised test-time updates can meaningfully increase hidden-state expressiveness in linear-complexity layers; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Self-supervised gradient updates on a small model serving as hidden state improve expressiveness without instability at test time
This is the load-bearing premise that allows linear complexity to coexist with high capacity.

invented entities (1)

TTT layer no independent evidence
purpose: Sequence modeling layer whose hidden state is a trainable model updated at test time
New architectural primitive introduced to overcome limited expressiveness of standard RNN states.

pith-pipeline@v0.9.0 · 5551 in / 1236 out tokens · 50554 ms · 2026-05-15T05:15:14.448221+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
cs.LG 2026-05 unverdicted novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Linearizing Vision Transformer with Test-Time Training
cs.CV 2026-05 unverdicted novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Cortico-cerebellar modularity as an architectural inductive bias for efficient temporal learning
q-bio.NC 2026-05 unverdicted novelty 5.0

CB-RNNs with a cerebellar feedforward module learn temporal tasks faster than matched RNNs, with the module driving efficiency even after freezing the recurrent core as a fixed reservoir.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
Measuring Accuracy and Energy-to-Solution of Quantum Fine-Tuning of Foundational AI Models
quant-ph 2026-05 conditional novelty 5.0

Trapped-ion quantum fine-tuning of AI models shows linear energy scaling and 24% better classification error than classical logistic regression or SVM baselines, with a projected energy break-even at 34 qubits.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 20 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016

work page 2016
[3]

You just found out your book was used to train ai

Authors Guild. You just found out your book was used to train ai. now what?, 2023. Accessed: 2024-06-24

work page 2023
[4]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

work page arXiv 2024
[5]

Learning a synaptic learning rule

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Citeseer, 1990

work page 1990
[6]

The nadaraya-watson kernel regression function estimator

Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. (Serie Research Memoranda; No. 1988-58). Faculty of Economics and Business Administration, Vrije Universiteit Amsterdam., 1988

work page 1988
[7]

Pattern recognition and machine learning , volume 4

Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning , volume 4. Springer, 2006

work page 2006
[8]

Gpt-neox-20b: An open-source autoregressive language model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022

work page arXiv 2022
[9]

Local learning algorithms.Neural computation, 4(6):888–900, 1992

Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

work page 1992
[10]

Variable kernel estimates of multivariate densities

Leo Breiman, William Meisel, and Edward Purcell. Variable kernel estimates of multivariate densities. Technometrics, 19(2):135–144, 1977

work page 1977
[11]

Weighted nadaraya–watson regression estimation

Zongwu Cai. Weighted nadaraya–watson regression estimation. Statistics & probability letters, 51(3):307–318, 2001

work page 2001
[12]

Training deep nets with sublinear memory cost, 2016

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016

work page 2016
[13]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[14]

A tutorial on kernel density estimation and recent advances

Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017

work page 2017
[15]

Meta-learning fast weight language models

Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, and Moham- mad Norouzi. Meta-learning fast weight language models. arXiv preprint arXiv:2212.02475, 2022

work page arXiv 2022
[16]

Large scale transductive svms

Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou, and Thorsten Joachims. Large scale transductive svms. Journal of Machine Learning Research, 7(8), 2006

work page 2006
[17]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024. 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for e fficient language models. arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review arXiv 2024
[19]

In the long (context) run, 2023

Harm de Vries. In the long (context) run, 2023. Accessed: 2024-06-24

work page 2023
[20]

Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982

Jerome A Feldman. Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982

work page 1982
[21]

Model-agnostic meta-learning for fast adapta- tion of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

work page 2017
[22]

Gammerman, V

A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998

work page 1998
[23]

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022

work page 2022
[24]

The pile: An 800gb dataset of diverse text for language modeling, 2020

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020

work page 2020
[25]

EasyLM: A Simple And Scalable Training Framework for Large Language Models

Xinyang Geng. EasyLM: A Simple And Scalable Training Framework for Large Language Models. https://github.com/young-geng/EasyLM, mar 2023. https://github.com/ young-geng/EasyLM

work page 2023
[26]

Unlocking state-tracking in linear rnns through negative eigenvalues

Riccardo Grazzi, Julien Siems, Arber Zela, Jörg KH Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. International Conference on Learning Representations (ICLR), 2024

work page 2024
[27]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Self-supervised policy adaptation during deployment

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020

work page arXiv 2007
[29]

Test-time training on nearest neighbors for large language models

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023

work page arXiv 2023
[30]

predictable

Horace He. Strangely, matrix multiplications on gpus run faster when given "predictable" data! [short], 2024. Accessed: 2024-06-30

work page 2024
[31]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Using fast weights to deblur old memories

Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987

work page 1987
[33]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[34]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[35]

The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention

Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning, pages 9639–9659. PMLR, 2022

work page 2022
[36]

Practical computational power of linear transformers and their recurrent and self-referential extensions

Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. Practical computational power of linear transformers and their recurrent and self-referential extensions. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[37]

Neural di fferential equations for learning to program neural nets through continuous learning rules

Kazuki Irie, Francesco Faccio, and Jürgen Schmidhuber. Neural di fferential equations for learning to program neural nets through continuous learning rules. Advances in Neural Information Processing Systems, 35:38614–38628, 2022

work page 2022
[38]

Going beyond linear transformers with recurrent fast weight programmers.Advances in Neural Information Processing Systems, 34:7703–7717, 2021

Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers.Advances in Neural Information Processing Systems, 34:7703–7717, 2021

work page 2021
[39]

A modern self-referential weight matrix that learns to modify itself

Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. A modern self-referential weight matrix that learns to modify itself. In International Conference on Machine Learning , pages 9660–9677. PMLR, 2022

work page 2022
[40]

Images as weight matrices: Sequential image generation through synaptic learning rules

Kazuki Irie and Jürgen Schmidhuber. Images as weight matrices: Sequential image generation through synaptic learning rules. International Conference on Learning Representations (ICLR), 2022

work page 2022
[41]

Online domain adaptation of a pre-trained cascade of classifiers

Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011

work page 2011
[42]

Learning to classify text using support vector machines, volume 668

Thorsten Joachims. Learning to classify text using support vector machines, volume 668. Springer Science & Business Media, 2002

work page 2002
[43]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[44]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[45]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[46]

Meta learning backpropagation and improving it

Louis Kirsch and Jürgen Schmidhuber. Meta learning backpropagation and improving it. Advances in Neural Information Processing Systems, 34:14122–14134, 2021

work page 2021
[47]

Dynamic evaluation of neural sequence models

Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pages 2766–2775. PMLR, 2018

work page 2018
[48]

Dynamic Evaluation of Transformer Language Models

Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[49]

E fficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. E fficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[50]

Building machines that learn and think like people

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017. 22

work page 2017
[51]

Building high-level features using large scale unsupervised learning

Quoc V Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595–8598. IEEE, 2013

work page 2013
[52]

World model on million-length video and language with blockwise ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024

work page arXiv 2024
[53]

Consistent video depth estimation

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

work page 2020
[54]

Gradient-based hyperparameter optimization through reversible learning

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015

work page 2015
[55]

Meta-Learning Update Rules for Unsupervised Representation Learning

Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

Online model distillation for efficient video inference

Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference. arXiv preprint arXiv:1812.02699, 2018

work page arXiv 2018
[57]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[58]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024

work page arXiv 2024
[60]

The devil in linear transformer

Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022

work page arXiv 2022
[61]

The perceptron: a probabilistic model for information storage and organiza- tion in the brain

Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza- tion in the brain. Psychological review, 65(6):386, 1958

work page 1958
[62]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

work page 2021
[63]

Learning associative inference using fast weight memory

Imanol Schlag, Tsendsuren Munkhdalai, and Jürgen Schmidhuber. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831, 2020

work page arXiv 2011
[64]

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987

work page 1987
[65]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

work page 1992
[66]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020
[67]

Normformer: Improved transformer pretraining with extra normalization

Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with extra normalization. arXiv preprint arXiv:2110.09456, 2021. 23

work page arXiv 2021
[68]

zero-shot

Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep inter- nal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3118–3126, 2018

work page 2018
[69]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

work page 2023
[70]

Learning to (learn at test time)

Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time). arXiv preprint arXiv:2310.13807, 2023

work page arXiv 2023
[71]

Online learning of unknown dynamics for model-based controllers in legged locomotion

Yu Sun, Wyatt L Ubellacker, Wen-Loong Ma, Xiang Zhang, Changhao Wang, Noel V Csomay- Shanklin, Masayoshi Tomizuka, Koushil Sreenath, and Aaron D Ames. Online learning of unknown dynamics for model-based controllers in legged locomotion. IEEE Robotics and Automation Letters, 6(4):8442–8449, 2021

work page 2021
[72]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020

work page 2020
[73]

Learning to learn: Introduction and overview

Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998

work page 1998
[74]

Using fast weights to improve persistent contrastive divergence

Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th annual international conference on machine learning, pages 1033–1040, 2009

work page 2009
[75]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[76]

The nature of statistical learning theory

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013

work page 2013
[77]

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, page 1096–1103, 2008

work page 2008
[78]

The correlation theory of brain function

Christoph Von Der Malsburg. The correlation theory of brain function. In Models of neural networks: Temporal aspects of coding and information processing in biological systems , pages 95–119. Springer, 1994

work page 1994
[79]

Test-time training on video streams

Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A Efros, and Xiaolong Wang. Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023

work page arXiv 2023
[80]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 24

work page internal anchor Pith review Pith/arXiv arXiv 1910

Showing first 80 references.