WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Deltaproduct: Im- proving state-tracking in linear rnns via householder products
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.
The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.
Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.
Hybrid Gated DeltaNet-Attention decoders solve parity-conditioned retrieval with O(1) scratchpad while pure Gated DeltaNet cannot and pure Gated Attention needs polynomial length.
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Dynamic Short Convolutions Improve Transformers
Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.
-
Memory by Design: Probabilistic Sequence Layers
The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.
-
Universal Time Series Generation with Neural Controlled Differential Equations
Proves SLiCEs are universal time-series generators approximating path laws in W_∞ and proposes G-SLiCEs for path-space flow matching with benefits on irregular grids.
-
Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
Hybrid Gated DeltaNet-Attention decoders solve parity-conditioned retrieval with O(1) scratchpad while pure Gated DeltaNet cannot and pure Gated Attention needs polynomial length.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.