pith. sign in

Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

citation-role summary

background 1 dataset 1

citation-polarity summary

fields

cs.LG 7 cs.CL 2

years

2026 6 2025 3

verdicts

UNVERDICTED 9

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 4 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Selective Rotary Position Embedding

cs.CL · 2025-11-21 · unverdicted · novelty 7.0

Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

Kaczmarz Linear Attention

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.

citing papers explorer

Showing 9 of 9 citing papers.

  • WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 39 · 4 links

    WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

  • Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 47

    Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

  • Selective Rotary Position Embedding cs.CL · 2025-11-21 · unverdicted · none · ref 57

    Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.

  • Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders cs.LG · 2026-05-15 · unverdicted · none · ref 15

    Hybrid Gated DeltaNet-Attention decoders solve parity-conditioned retrieval with O(1) scratchpad while pure Gated DeltaNet cannot and pure Gated Attention needs polynomial length.

  • OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 51

    OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

  • M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling cs.LG · 2026-03-15 · unverdicted · none · ref 35

    M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

  • Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 53

    Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

  • MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 36

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

  • Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 33

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.