Rnns are not transformers (yet): The key bottleneck on in-context retrieval

Kaiyue Wen, Xingyu Dang, Kaifeng Lyu · 2024 · arXiv 2402.18510

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

citing papers explorer

Showing 3 of 3 citing papers.

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 62
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 107
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 83
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

Rnns are not transformers (yet): The key bottleneck on in-context retrieval

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer