Moa: Mixture of sparse attention for automatic large language model compression

Fu, T · 2024 · arXiv 2406.14909

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

cs.LG · 2025-12-14 · unverdicted · novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

MoBA: Mixture of Block Attention for Long-Context LLMs

cs.LG · 2025-02-18 · unverdicted · novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

cs.CL · 2025-02-16 · unverdicted · novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

cs.LG · 2026-02-20 · 2 refs

citing papers explorer

Showing 5 of 5 citing papers.

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics cs.LG · 2025-12-14 · unverdicted · none · ref 9
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 25
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MoBA: Mixture of Block Attention for Long-Context LLMs cs.LG · 2025-02-18 · unverdicted · none · ref 24
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 54
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unreviewed · ref 12 · 2 links

Moa: Mixture of sparse attention for automatic large language model compression

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer