pith. machine review for the scientific record. sign in

arxiv: 2510.26692 · v2 · submitted 2025-10-30 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kimi Linear: An Expressive, Efficient Attention Architecture

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:42 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords linear attentionKimi Delta Attentionhybrid attentionKV cachelong contextdecoding throughputdelta ruleefficient transformers
0
0 comments X

The pith

Kimi Linear, a hybrid linear attention model, outperforms full attention across contexts while cutting KV cache by up to 75%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Kimi Linear as a hybrid architecture combining linear and full attention layers. It claims this design beats standard full attention models even when trained identically on the same data. The improvement comes from Kimi Delta Attention, which adds finer-grained gating to make better use of the limited memory in linear attention. Experiments with a 3B-active-parameter model show gains on short and long tasks plus reinforcement learning scaling, plus major efficiency wins in memory and speed. If correct, this suggests linear attention can replace full attention as a practical, higher-performing option for large models.

Core claim

Kimi Linear is a hybrid linear attention architecture that for the first time outperforms full attention under fair comparisons across short-context, long-context, and reinforcement learning scaling regimes. Its core is Kimi Delta Attention, an expressive linear module that extends Gated DeltaNet with finer-grained gating to use finite-state RNN memory more effectively. A specialized chunkwise algorithm based on a Diagonal-Plus-Low-Rank transition matrix variant keeps computation low. A model with 3B activated parameters and 48B total parameters, built from a layerwise mix of KDA and Multi-Head Latent Attention, exceeds full MLA performance while reducing KV cache usage by up to 75% and up 6

What carries the argument

Kimi Delta Attention (KDA) module with finer-grained gating on top of Gated DeltaNet, combined with a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix for efficient chunkwise computation.

Load-bearing premise

The performance gains come from the finer-grained gating in KDA and the specialized DPLR variant rather than from differences in training data, hyperparameters, or evaluation setup.

What would settle it

Train identical Kimi Linear and full attention models on the exact same data and hyperparameters, then measure whether the linear version still scores higher on the reported benchmarks.

read the original abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Kimi Linear, a hybrid linear attention architecture whose core is Kimi Delta Attention (KDA), an extension of Gated DeltaNet that adds finer-grained gating to improve utilization of finite-state RNN memory. It employs a bespoke chunkwise algorithm based on a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix that reduces computation relative to the general DPLR form while staying closer to the classical delta rule. A 3B-activated / 48B-total-parameter model is pretrained as a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). The central empirical claim is that, under an identical training recipe, Kimi Linear outperforms full MLA across short-context, long-context, and RL scaling regimes, while cutting KV cache usage by up to 75% and delivering up to 6x decoding throughput at 1M context. The KDA kernel, vLLM integration, and model checkpoints are released.

Significance. If the performance margins survive rigorous verification of identical training conditions, the result would be significant: it would demonstrate that a carefully designed linear attention module can surpass full attention in both accuracy and efficiency across multiple regimes, offering a practical drop-in replacement that materially reduces inference cost for long contexts. The open-sourcing of the kernel and models further strengthens the contribution by enabling direct reproduction and extension.

major comments (2)
  1. Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.
  2. §3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.
minor comments (2)
  1. Figure captions and axis labels in the throughput and KV-cache plots should explicitly state the context lengths and batch sizes used so readers can directly compare the 1M-context 6x claim.
  2. Notation for the gating variables in the KDA equations should be unified across the text and pseudocode to avoid ambiguity between per-head and per-dimension gates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses
  1. Referee: Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.

    Authors: We appreciate this point and agree that explicit documentation is necessary to support the attribution of gains to the architecture. The training runs for Kimi Linear and the full MLA baseline used identical data mixtures, optimizer settings, learning-rate schedules, batch sizes, and all other hyperparameters. In the revised manuscript we will add: (1) side-by-side hyperparameter tables, (2) precise data-mix specifications, and (3) an ablation study that replaces only the attention module while freezing every other training factor. These additions will make the identical-recipe claim fully verifiable and will confirm that the observed margins arise from KDA rather than extraneous differences. revision: yes

  2. Referee: §3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.

    Authors: We agree that quantitative support for the design choice would strengthen the paper. In the revision we will expand §3.2 (or add an appendix) with: (i) asymptotic and practical flop-count comparisons, (ii) explicit complexity analysis of the specialized versus general DPLR transition matrices, and (iii) numerical-stability experiments (including forward-pass error accumulation and gradient-norm statistics) on both formulations. These results will demonstrate the computational savings and closer fidelity to the classical delta rule that motivated the specialized variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new architecture and reported benchmarks

full rationale

The paper presents an empirical architecture paper whose central claim is that a hybrid of KDA (finer-grained gating extension of Gated DeltaNet) and MLA outperforms full attention under an identical training recipe, with measured KV-cache and throughput gains. No load-bearing mathematical derivation, prediction, or uniqueness theorem is offered that reduces by construction to fitted inputs or prior self-citations. The architecture description introduces new components (chunkwise DPLR variant, gating mechanism) whose performance is asserted via experimental results rather than self-referential definitions or renamed known patterns. Self-citations, if present for Gated DeltaNet, are not load-bearing for the outperformance claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the architectural description of KDA and hybrid layers.

pith-pipeline@v0.9.0 · 5802 in / 1107 out tokens · 55906 ms · 2026-05-13T23:42:51.206799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

    cs.LG 2026-04 unverdicted novelty 7.0

    Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

  2. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  3. Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

    cs.CV 2026-04 unverdicted novelty 7.0

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  4. OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

    cs.LG 2026-05 unverdicted novelty 6.0

    OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

  5. $\delta$-mem: Efficient Online Memory for Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...

  6. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  7. Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

    cs.LG 2026-05 unverdicted novelty 6.0

    CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.

  8. UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

    cs.CL 2026-05 unverdicted novelty 6.0

    UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.

  9. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  10. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

    cs.DC 2026-04 unverdicted novelty 6.0

    PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...

  11. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  12. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  13. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  14. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  15. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  16. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  17. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

    cs.DC 2026-05 unverdicted novelty 5.0

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

  18. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

  19. SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...

  20. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

  21. LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 21 Pith papers · 27 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal et al. “gpt-oss-120b & gpt-oss-20b model card”. In:arXiv preprint arXiv:2508.10925(2025)

  2. [2]

    Colt5: Faster long-range transformers with conditional computation

    Joshua Ainslie et al. “Colt5: Faster long-range transformers with conditional computation”. In:arXiv preprint arXiv:2303.09752(2023)

  3. [3]

    Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

    Zeyuan Allen-Zhu. “Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers”. In:SSRN Electronic Journal(May 2025). Available at SSRN: https://ssrn.com/abstract=5240330 or http://dx.doi.org/10.2139/ssrn.5240330.DOI:10.2139/ssrn.5240330

  4. [4]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora et al. “Simple linear attention language models balance the recall-throughput tradeoff”. In: Forty-first International Conference on Machine Learning. 2024.URL: https://openreview.net/forum? id=e93ffDcpH3

  5. [5]

    Simran Arora et al.Zoology: Measuring and Improving Recall in Efficient Language Models. 2023. arXiv: 2312.04927 [cs.CL]

  6. [6]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multi- tasks”. In:arXiv preprint arXiv:2412.15204(2024)

  7. [7]

    Round and Round We Go! What makes Rotary Positional Encodings useful?

    Federico Barbero et al. “Round and Round We Go! What makes Rotary Positional Encodings useful?” In: Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=GtvuNrk58a

  8. [8]

    Atlas: Learning to optimally memorize the context at test time, 2025

    Ali Behrouz et al. “Atlas: Learning to optimally memorize the context at test time”. In:arXiv preprint arXiv:2505.23735(2025)

  9. [9]

    Unlimiformer: Long-range transformers with unlimited length input

    Amanda Bertsch et al. “Unlimiformer: Long-range transformers with unlimited length input”. In:Advances in NeurIPS36 (2023), pp. 35522–35543

  10. [10]

    Biderman, H

    Stella Biderman et al. “Lessons from the trenches on reproducible evaluation of language models”. In:arXiv preprint arXiv:2405.14782(2024)

  11. [11]

    The WY Representation for Products of Householder Matrices

    Christian Bischof and Charles Van Loan. “The WY Representation for Products of Householder Matrices”. In: SIAM Journal on Scientific and Statistical Computing(1987), s2–s13.URL: https://doi.org/10.1137/ 0908009

  12. [12]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

    Aaron Blakeman et al. “Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models”. In: arXiv preprint arXiv:2504.03624(2025)

  13. [13]

    Long code arena: a set of benchmarks for long-context code models

    Egor Bogomolov et al. “Long code arena: a set of benchmarks for long-context code models”. In:arXiv preprint arXiv:2406.11612(2024)

  14. [14]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)

  15. [15]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui et al. “The entropy mechanism of reinforcement learning for reasoning language models”. In:arXiv preprint arXiv:2505.22617(2025)

  16. [16]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060 . arXiv:2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060

  17. [17]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in NeurIPS. 2022, pp. 16344–16359.URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf

  18. [18]

    DeepSeek-AI.DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. 2025

  19. [19]

    DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL] .URL: https:// arxiv.org/abs/2412.19437

  20. [20]

    Jiayu Ding et al.LongNet: Scaling Transformers to 1,000,000,000 Tokens. 2023. arXiv: 2307.02486 [cs.CL]. URL:https://arxiv.org/abs/2307.02486

  21. [21]

    Juechu Dong et al.Flex Attention: A Programming Model for Generating Optimized Attention Kernels. 2024. arXiv:2412.05496 [cs.LG].URL:https://arxiv.org/abs/2412.05496

  22. [22]

    Xin Dong et al.Hymba: A Hybrid-head Architecture for Small Language Models. 2024. arXiv: 2411.13676 [cs.CL].URL:https://arxiv.org/abs/2411.13676

  23. [23]

    Mom: Linear sequence modeling with mixture-of-memories

    Jusen Du et al. “Mom: Linear sequence modeling with mixture-of-memories”. In:arXiv preprint arXiv:2502.13685(2025)

  24. [24]

    Native Hybrid Attention for Efficient Sequence Modeling

    Jusen Du et al. “Native Hybrid Attention for Efficient Sequence Modeling”. In:arXiv preprint arXiv:2510.07019 (2025)

  25. [25]

    Moa: Mixture of sparse attention for automatic large language model compression

    Tianyu Fu et al. “Moa: Mixture of sparse attention for automatic large language model compression”. In:arXiv preprint arXiv:2406.14909(2024). 19 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT

  26. [26]

    Are we done with mmlu? CoRR, abs/2406.04127,

    Aryo Pradipta Gema et al. “Are we done with mmlu?” In:arXiv preprint arXiv:2406.04127(2024)

  27. [27]

    Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

    Riccardo Grazzi et al. “Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues”. In:Proceed- ings of ICLR. 2025.URL:https://openreview.net/forum?id=UvTo3tVBk2

  28. [28]

    How ordinary elimination became Gaussian elimination

    Joseph F. Grcar. “How ordinary elimination became Gaussian elimination”. In:Historia Mathematica38.2 (May 2011), pp. 163–218.ISSN: 0315-0860.DOI: 10.1016/j.hm.2010.06.003 .URL: http://dx.doi. org/10.1016/j.hm.2010.06.003

  29. [29]

    Albert Gu and Tri Dao.Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023. arXiv: 2312.00752 [cs.LG]

  30. [30]

    Albert Gu, Karan Goel, and Christopher Ré.Efficiently Modeling Long Sequences with Structured State Spaces

  31. [31]

    arXiv:2111.00396 [cs.LG]

  32. [32]

    Xiangming Gu et al.When Attention Sink Emerges in Language Models: An Empirical View. 2025. arXiv: 2410.10781 [cs.CL].URL:https://arxiv.org/abs/2410.10781

  33. [33]

    Yuxian Gu et al.Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search. 2025. arXiv: 2508.15884 [cs.CL].URL:https://arxiv.org/abs/2508.15884

  34. [34]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature 645.8081 (2025), pp. 633–638

  35. [35]

    Log-linear attention

    Han Guo et al. “Log-linear attention”. In:arXiv preprint arXiv:2506.04761(2025)

  36. [36]

    Star-transformer

    Qipeng Guo et al. “Star-transformer”. In:arXiv preprint arXiv:1902.09113(2019)

  37. [37]

    Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009.03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300

  38. [38]

    Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv: 2203 . 15556 [cs.CL].URL:https://arxiv.org/abs/2203.15556

  39. [39]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh et al. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: arXiv preprint arXiv:2404.06654(2024)

  40. [40]

    Attractor memory for long-term time series forecasting: A chaos perspective

    Jiaxi Hu et al. “Attractor memory for long-term time series forecasting: A chaos perspective”. In:Advances in NeurIPS37 (2024), pp. 20786–20818

  41. [41]

    Comba: Improving Bilinear

    Jiaxi Hu et al. “Comba: Improving Nonlinear RNNs with Closed-loop Control”. In:arXiv preprint arXiv:2506.02475(2025)

  42. [42]

    Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length General- ization

    Ermo Hua et al. “Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length General- ization”. In:arXiv preprint arXiv:2412.17739(2024)

  43. [43]

    Transformer Quality in Linear Time

    Weizhe Hua et al. “Transformer Quality in Linear Time”. In:Proceedings of ICML. Ed. by Kamalika Chaudhuri et al. PMLR, 2022, pp. 9099–9117.URL:https://proceedings.mlr.press/v162/hua22a.html

  44. [44]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010

  45. [45]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)

  46. [46]

    Samy Jelassi et al.Repeat After Me: Transformers are Better than State Space Models at Copying. 2024. arXiv: 2402.01032 [cs.LG]

  47. [47]

    Accumulating Householder transformations, revisited

    Thierry Joffrain et al. “Accumulating Householder transformations, revisited”. In: (2006), pp. 169–179.URL: https://doi.org/10.1145/1141885.1141886

  48. [48]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)

  49. [49]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion

    Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https://proceedings.mlr.press/v119/katharopoulos20a.html

  50. [50]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad et al. “The impact of positional encoding on length generalization in transformers”. In:Advances in NeurIPS36 (2023), pp. 24892–24928

  51. [51]

    Kimi K2: Open Agentic Intelligence

    Team Kimi et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025)

  52. [52]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer”. In:arXiv preprint arXiv:2001.04451(2020)

  53. [53]

    Krishna, K

    Satyapriya Krishna et al. “Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation”. In: arXiv preprint arXiv:2409.12941(2024)

  54. [54]

    A Survey of Post-Training Scaling in Large Language Models

    Hanyu Lai et al. “A Survey of Post-Training Scaling in Large Language Models”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771– 2791. 20 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT

  55. [55]

    Liger: Linearizing Large Language Models to Gated Recurrent Structures

    Disen Lan et al. “Liger: Linearizing Large Language Models to Gated Recurrent Structures”. In:arXiv preprint arXiv:2503.01496(2025)

  56. [56]

    CMMLU: Measuring massive multitask language understanding in Chinese

    Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...

  57. [57]

    Transmamba: Flexibly switching between transformer and mamba

    Yixing Li et al. “Transmamba: Flexibly switching between transformer and mamba”. In:arXiv preprint arXiv:2503.24067(2025)

  58. [58]

    Opher Lieber et al.Jamba: A Hybrid Transformer-Mamba Language Model. 2024. arXiv: 2403 . 19887 [cs.CL]

  59. [59]

    Forgetting transformer: Softmax attention with a forget gate

    Zhixuan Lin et al. “Forgetting transformer: Softmax attention with a forget gate”. In:arXiv preprint arXiv:2503.02130(2025)

  60. [60]

    Longhorn: State space models are amortized online learners

    Bo Liu et al. “Longhorn: State Space Models are Amortized Online Learners”. In:ArXivabs/2407.14207 (2024). URL:https://api.semanticscholar.org/CorpusID:271310065

  61. [61]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In:Thirty-seventh Conference on NeurIPS. 2023.URL: https://openreview. net/forum?id=1qvx610Cu7

  62. [62]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu et al. “Repoqa: Evaluating long context code understanding”. In:arXiv preprint arXiv:2406.06025 (2024)

  63. [63]

    Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982

  64. [64]

    Enzhe Lu et al.MoBA: Mixture of Block Attention for Long-Context LLMs. 2025. arXiv: 2502.13189 [cs.LG]. URL:https://arxiv.org/abs/2502.13189

  65. [65]

    The illusion of state in state-space models

    William Merrill, Jackson Petty, and Ashish Sabharwal. “The illusion of state in state-space models”. In:arXiv preprint arXiv:2404.08819(2024)

  66. [66]

    The Parallelism Tradeoff: Limitations of Log-Precision Transformers

    William Merrill and Ashish Sabharwal. “The Parallelism Tradeoff: Limitations of Log-Precision Transformers”. In:Transactions of the Association for Computational Linguistics11 (2023), pp. 531–545.DOI: 10.1162/ tacl_a_00562.URL:https://aclanthology.org/2023.tacl-1.31/

  67. [67]

    MiniMax et al.MiniMax-01: Scaling Foundation Models with Lightning Attention. 2025. arXiv: 2501.08313 [cs.CL]

  68. [68]

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal.Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. 2024. arXiv:2404.07143 [cs.CL]

  69. [69]

    Tsendsuren Munkhdalai and Adam Trischler.Metalearning with Hebbian Fast Weights. 2018. arXiv: 1807. 05076 [cs.NE].URL:https://arxiv.org/abs/1807.05076

  70. [70]

    Metalearned Neural Memory

    Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407

  71. [71]

    Training language models to follow instructions with human feedback

    Long Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in NeurIPS35 (2022), pp. 27730–27744

  72. [72]

    Bo Peng et al.RWKV-7 "Goose" with Expressive Dynamic State Evolution. 2025. arXiv: 2503.14456 [cs.CL]

  73. [73]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

  74. [74]

    Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

    Piotr Pi˛ ekos, Róbert Csordás, and Jürgen Schmidhuber. “Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing”. In:arXiv preprint arXiv:2505.00315(2025)

  75. [75]

    Reasoning with large language models, a survey

    Aske Plaat et al. “Reasoning with large language models, a survey”. In:CoRR(2024)

  76. [76]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In:Proceedings of ICLR. 2022.URL: https://openreview.net/forum?id= R8sQPpGCv0

  77. [77]

    Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

    Krishna C. Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling. 2025. arXiv:2504.08719 [cs.CL]

  78. [78]

    Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]

  79. [79]

    Zhen Qin et al.TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

  80. [80]

    arXiv:2307.14995 [cs.CL]

Showing first 80 references.