MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Alexander Meulemans; Blaise Ag\"uera y Arcas; Charlotte Frenkel; Guillaume Lajoie; Jo\~ao Sacramento; Johannes von Oswald; Kaitlin Maile; Luca Versari; Maximilian Schlegel; Nino Scherrer

arxiv: 2506.05233 · v2 · pith:OKUXID3Inew · submitted 2025-06-05 · 💻 cs.LG · cs.AI· cs.CL

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes von Oswald , Nino Scherrer , Seijin Kobayashi , Luca Versari , Songlin Yang , Sarthak Mittal , Maximilian Schlegel , Kaitlin Maile

show 9 more authors

Yanick Schimpf Oliver Sieberling Alexander Meulemans Rif A. Saurous Guillaume Lajoie Charlotte Frenkel Razvan Pascanu Blaise Ag\"uera y Arcas Jo\~ao Sacramento

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords computelayermodelingperformancetest-timetimeduringhere

0 comments

read the original abstract

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
cs.LG 2026-05 unverdicted novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
cs.LG 2025-11 unverdicted novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Higher-order Linear Attention
cs.LG 2025-10 unverdicted novelty 6.0

Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
cs.CL 2025-09 unverdicted novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.