EntmaxKV: Support-Aware Decoding for Entmax Attention

Gon\c{c}alo Duarte; Marcos V. Treviso; Miguel Couceiro

arxiv: 2605.21649 · v1 · pith:NEVHBA3Onew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

EntmaxKV: Support-Aware Decoding for Entmax Attention

Gon\c{c}alo Duarte , Miguel Couceiro , Marcos V. Treviso This is my paper

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords entmax attentionsparse decodingKV cache optimizationlong context inferencesupport recoverytransformer efficiencyattention sparsity

0 comments

The pith

Sparse decoding for entmax attention becomes exact when the selected KV pages include the full support set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that α-entmax attention, unlike softmax, produces exact zeros and therefore allows sparse KV decoding to be exact rather than an approximation. This matters for long-context generation because it reduces the need to load the entire growing KV cache for each new token. The approach combines query-aware page scoring with a selector that adapts the number of pages loaded based on estimated entmax thresholds from page statistics. If the support is captured, the output matches the full computation exactly and any error from truncation is limited to the dropped mass which goes to zero as more pages are included. Experiments show it matches full entmax performance with far less cache and delivers substantial speedups at million-token contexts.

Core claim

EntmaxKV is an entmax-native sparse decoding framework that exploits the exact zeros of α-entmax to perform support recovery. If the selected candidates contain the entmax support, sparse decoding remains exact. The truncation error is controlled by the dropped probability mass δ and vanishes when the support is recovered. A Gaussian-aware entmax selector estimates the entmax threshold from lightweight page statistics to adapt the selected budget to the score distribution.

What carries the argument

Support-aware candidate selection that identifies KV pages likely to hold the entmax support using Gaussian statistics on page scores.

If this is right

If the selected candidates contain the entmax support, sparse decoding remains exact.
Output error is controlled by the dropped probability mass δ and vanishes when the support is recovered.
EntmaxKV drops less probability mass and retains more support tokens than softmax-based sparse decoding at matched KV budgets.
It closely matches full-cache entmax while using a small fraction of the KV cache on long-context benchmarks.
Achieves up to 5.43× speedup over full attention baselines at 1M context length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same support-recovery idea could be applied to other attention variants that admit exact sparsity.
Adaptive budget selection based on score distributions might reduce cache requirements across different model architectures.
Exact support recovery could make it easier to combine sparse decoding with other efficiency techniques such as caching or pruning.

Load-bearing premise

The Gaussian-aware entmax selector can adapt the selected budget so that the true support is recovered with high probability at modest cache budgets.

What would settle it

A comparison of EntmaxKV outputs against full entmax attention on the same queries, checking whether differences exceed the measured dropped mass δ when the selector is applied at the reported budgets.

Figures

Figures reproduced from arXiv: 2605.21649 by Gon\c{c}alo Duarte, Marcos V. Treviso, Miguel Couceiro.

**Figure 2.** Figure 2: Attention approximation quality in terms of the dropped probability mass [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Perplexity numbers on PG19. In contrast to softmax sparse decoding, entmax sparse decoding is close to full entmax across budgets. We evaluate language-modeling perplexity on PG19 [17] using two 1B-parameter models with 32k context windows: one trained with softmax attention and one trained with entmax attention. We compare full-cache decoding against sparse decoding with token budgets of 1024, 2048, 4096… view at source ↗

**Figure 5.** Figure 5: Wall-clock time speedup relative to Softmax (FlashDecoding) at different KV-cache lengths. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of softmax and α-entmax for different values of α, alongside top-k variants with k = 2. Each panel shows how p0 varies for the input z = [0, z1, z2]. A The α-entmax Transformation For α > 1, the α-entmax transformation [16] maps scores s ∈ R n to a probability distribution p ∈ △n of the form pi = [(α − 1)si − τ ] 1/(α−1) + , (23) where τ ∈ R is chosen so that Xn i=1 pi = 1. (24) The notation … view at source ↗

read the original abstract

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $\alpha$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $\delta$, showing that output error is controlled by $\delta$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EntmaxKV gives a support-recovery route to exact sparse decoding for entmax attention that sidesteps the mass-loss problem of softmax methods, though the Gaussian selector's accuracy on real score distributions remains the main practical question.

read the letter

The useful move here is treating entmax decoding as support recovery rather than probability truncation. Because entmax already sets many entries to exact zero, selecting a candidate set that contains the true support makes the sparse version identical to the full one. The paper builds this with query-aware page scoring, then a candidate selector that fits a Gaussian to per-page statistics to pick a budget likely to cover the support, plus the usual sparse entmax kernel at the end. The truncation analysis is straightforward: output difference is controlled by the dropped mass δ, and δ vanishes on full support recovery. That framing is cleaner than the usual softmax sparse-decoding approximations. Empirically they report lower δ, better support retention, and speedups of 3-5x over full attention at 1M context while staying close to the dense entmax baseline. Those numbers, if they hold under full ablations, would matter for long-context serving. The soft spot is the Gaussian threshold estimator. Attention logits per page are often skewed or heavy-tailed, so mean-variance fits can underestimate the threshold needed to capture the support. When that happens the selected budget under-covers and exactness is lost even if average δ looks small. The paper does not give non-asymptotic recovery bounds, so the guarantee is only as good as the normality assumption. Without seeing the full experimental splits and selector ablations it is hard to judge how often this actually occurs in practice. This work is for groups already running entmax or other sparse attention variants and looking for inference-time cache savings. It has enough new pieces and a coherent error story to deserve referee time, even if the selector needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces EntmaxKV, a sparse decoding framework tailored to α-entmax attention for long-context inference. Unlike softmax-based sparse methods that must approximate dense tails, EntmaxKV exploits entmax's exact zeros to reduce sparse decoding to support recovery: if the selected KV pages contain the entmax support, decoding is exact. The framework combines query-aware page scoring, a Gaussian-aware entmax selector that estimates the threshold from lightweight page statistics to adapt the cache budget, and sparse entmax attention. Truncation error is analyzed via the dropped probability mass δ, with the claim that output error is bounded by δ and vanishes upon support recovery. Empirically, EntmaxKV retains more support tokens and achieves lower error than softmax sparse baselines at matched KV budgets, closely matching full-cache entmax while delivering up to 5.43× speedup over full attention at 1M context length.

Significance. If the Gaussian-aware selector recovers the true support with high probability at modest budgets, the work supplies a principled, entmax-native approach to exact sparse attention that directly addresses the KV-cache bottleneck. The δ-based truncation analysis is a clear strength, as is the empirical demonstration that EntmaxKV matches full entmax performance with a small cache fraction. The public code release further supports reproducibility.

major comments (2)

[Gaussian-aware entmax selector and support-recovery analysis] The exactness guarantee (sparse decoding remains exact when the selected candidates contain the entmax support) is load-bearing for the central claim. The Gaussian-aware selector estimates the entmax threshold by fitting mean/variance to page statistics under a normality assumption and solving for the target sparsity. No non-asymptotic bound is provided on the probability of support recovery when per-page score distributions exhibit heavier tails or skewness (common in attention logits). If the fitted threshold is systematically too low, the selected budget under-covers the support, δ does not vanish, and the exactness claim fails even when average δ appears small.
[Truncation error analysis] § on truncation error analysis: while the logical relation between output error and dropped mass δ is sound, the manuscript reports only average δ and average support retention. A worst-case or per-sequence analysis (or variance across long contexts) is needed to confirm that support misses do not occur systematically at the modest budgets where speedups are claimed.

minor comments (2)

The abstract states speedups of 3.36× (softmax) and 5.43× (entmax) over full attention baselines; clarifying whether the entmax speedup is measured against full-cache entmax or against a different baseline would avoid ambiguity.
[Experiments] Experimental details on data splits, number of random seeds, and ablations isolating the contribution of the Gaussian selector versus simpler fixed-budget selection would help readers assess robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Gaussian-aware entmax selector and support-recovery analysis] The exactness guarantee (sparse decoding remains exact when the selected candidates contain the entmax support) is load-bearing for the central claim. The Gaussian-aware selector estimates the entmax threshold by fitting mean/variance to page statistics under a normality assumption and solving for the target sparsity. No non-asymptotic bound is provided on the probability of support recovery when per-page score distributions exhibit heavier tails or skewness (common in attention logits). If the fitted threshold is systematically too low, the selected budget under-covers the support, δ does not vanish, and the exactness claim fails even when average δ appears small.

Authors: We agree that the Gaussian assumption underlies the selector and that the absence of non-asymptotic recovery bounds under heavier tails or skewness is a limitation of the current analysis. The manuscript introduces the selector as a lightweight, practical estimator that adapts the budget from page statistics, with empirical results demonstrating high support retention and low error on the reported benchmarks. In revision we will add an explicit discussion of the normality assumption's limitations together with new experiments on synthetic score distributions exhibiting skewness and heavy tails to quantify robustness. Deriving general non-asymptotic bounds remains an open theoretical question beyond the scope of this work. revision: partial
Referee: [Truncation error analysis] § on truncation error analysis: while the logical relation between output error and dropped mass δ is sound, the manuscript reports only average δ and average support retention. A worst-case or per-sequence analysis (or variance across long contexts) is needed to confirm that support misses do not occur systematically at the modest budgets where speedups are claimed.

Authors: We concur that aggregate averages alone leave open the possibility of systematic per-sequence misses. The manuscript establishes that output error is controlled by δ and vanishes upon support recovery, supported by average metrics across long-context and language-modeling tasks. In the revised version we will include per-sequence statistics, variance of δ and support retention, and selected worst-case examples from the 1M-context benchmarks to demonstrate that support recovery holds reliably at the budgets used for the reported speedups. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on prior entmax properties and independent new components

full rationale

The paper's central claims derive from the known support-sparsity property of α-entmax (prior literature) and introduce independent mechanisms: query-aware page scoring, support-aware selection, Gaussian-aware threshold estimation from page statistics, and truncation-error analysis via dropped mass δ. None of these reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The error bound (output error controlled by δ, vanishes on support recovery) follows directly from entmax definition without internal fitting. The selector is presented as a practical estimator, not a tautological result. This is self-contained against external benchmarks and matches the default non-circular outcome.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical property that entmax produces exact zeros and on the practical assumption that a lightweight Gaussian model can locate the support threshold; no new physical entities are postulated and the only free parameters appear to be the selection budget and the entmax alpha inherited from prior work.

free parameters (2)

selection budget
The number of KV pages or tokens retained is adapted by the selector and directly controls both speed and the probability of recovering the full support.
entmax alpha
The sparsity parameter alpha is a hyperparameter carried over from earlier entmax papers and controls the size of the support.

axioms (1)

domain assumption α-entmax attention produces exact zero probabilities outside a finite support.
This property, stated in the abstract, is what converts sparse decoding from approximation to exact support recovery.

pith-pipeline@v0.9.0 · 5853 in / 1464 out tokens · 50844 ms · 2026-05-22T08:58:01.402244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

if the selected candidates contain the entmax support, sparse decoding remains exact... dropped probability mass δ... Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

α-entmax(s)i = [(α−1)si − τ]_{+}^{1/(α−1)} ... support S = {i : (α−1)si > τ}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Blondel, A

M. Blondel, A. Martins, and V . Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 606–615. PMLR, 16...

work page 2019
[2]

Z. Chen, R. Sadhukhan, Z. Ye, Y . Zhou, J. Zhang, N. Nolte, Y . Tian, M. Douze, L. Bottou, Z. Jia, et al. Magicpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179, 2024

work page arXiv 2024
[3]

G. M. Correia, V . Niculae, and A. F. T. Martins. Adaptively sparse transformers. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2174–2184, Hong Kong, China, Nov. 2019. As...

work page doi:10.18653/v1/d19-1223 2019
[4]

T. Dao, D. Haziza, F. Massa, and G. Sizov. Flash-decoding for long-context inference.CRFM Blog, 10 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2025-12-01

work page 2023
[5]

Devoto, Y

A. Devoto, Y . Zhao, S. Scardapane, and P. Minervini. A simple and effectivel_2 norm-based strategy for KV cache compression. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, Miami, Florida, USA, Nov. 2024. Association for Computational Lingui...

work page doi:10.18653/v1/2024.emnlp-main.1027 2024
[6]

Gonçalves, M

N. Gonçalves, M. V . Treviso, and A. Martins. Adasplash: Adaptive sparse flash atten- tion. InForty-second International Conference on Machine Learning, 2025. URL https: //openreview.net/forum?id=OWIPDWhUcO

work page 2025
[7]

AdaSplash-2: Faster Differentiable Sparse Attention

N. Gonçalves, H. Pitorro, V . Niculae, E. Ponti, L. Li, A. Martins, and M. Treviso. Adasplash-2: Faster differentiable sparse attention.arXiv preprint arXiv:2604.15180, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024. URL https://arxiv.org/abs/2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[10]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[11]

C. Lin, J. Tang, S. Yang, H. Wang, T. Tang, B. Tian, I. Stoica, S. Han, and M. Gao. Twilight: Adaptive attention sparsity with hierarchical top-$p$ pruning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=Ve693NkzcU

work page 2025
[12]

Martins and R

A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In M. F. Balcan and K. Q. Weinberger, editors,International Conference on Machine Learning (ICML), volume 48 ofProceedings of Machine Learning Research, pages 1614–1623, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.m...

work page 2016
[13]

A. Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4(7):2025, 2025

work page 2025
[14]

Nawrot, A

P. Nawrot, A. Ła ´ncucki, M. Chochowski, D. Tarjan, and E. Ponti. Dynamic memory compres- sion: Retrofitting LLMs for accelerated inference. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=tDRYrAkOB7

work page 2024
[15]

M. Oren, M. Hassid, N. Yarden, Y . Adi, and R. Schwartz. Transformers are multi-state RNNs. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main

work page doi:10.18653/v1/2024.emnlp-main 2024
[16]

URLhttps://aclanthology.org/2024.emnlp-main.1043/

work page 2024
[17]

In: Zong, C., Xia, F., Li, W., Navigli, R

B. Peters, V . Niculae, and A. F. T. Martins. Sparse sequence-to-sequence models. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1504–1519, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/ P19-1146. URLhttps://www.aclweb.org/anthology/P19-1146

work page doi:10.18653/v1/ 2019
[18]

J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SylKikSYDH

work page 2020
[19]

Singhania, S

P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele. Loki: Low-rank keys for efficient sparse attention. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 16692–16723. Curran Associates, Inc., 2024. doi: 10.52202/ 079017-0532. URL https:...

work page 2024
[20]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=KzACYw0MTV

work page 2024
[21]

T. K. Team. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Vasylenko, M

P. Vasylenko, M. Treviso, and A. F. Martins. Long-context generalization with sparse attention. arXiv preprint arXiv:2506.16640, 2025. 11

work page arXiv 2025
[23]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=NG7sS51zVF

work page 2024
[24]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InThirty-seventh Conference on Neural Information Processing Systems,

work page
[25]

URLhttps://openreview.net/forum?id=RkRrPp7GKO

work page
[26]

K. Zhu, T. Tang, Q. Xu, Y . Gu, Z. Zeng, R. Kadekodi, L. Zhao, A. Li, A. Krishnamurthy, and B. Kasikci. Tactic: Adaptive sparse attention with clustering and distribution fitting for long-context llms.arXiv preprint arXiv:2502.12216, 2025. 12 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 Softmax 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 1.5-Entmax 2 0 2 z1 2 02...

work page arXiv 2025

[1] [1]

Blondel, A

M. Blondel, A. Martins, and V . Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 606–615. PMLR, 16...

work page 2019

[2] [2]

Z. Chen, R. Sadhukhan, Z. Ye, Y . Zhou, J. Zhang, N. Nolte, Y . Tian, M. Douze, L. Bottou, Z. Jia, et al. Magicpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179, 2024

work page arXiv 2024

[3] [3]

G. M. Correia, V . Niculae, and A. F. T. Martins. Adaptively sparse transformers. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2174–2184, Hong Kong, China, Nov. 2019. As...

work page doi:10.18653/v1/d19-1223 2019

[4] [4]

T. Dao, D. Haziza, F. Massa, and G. Sizov. Flash-decoding for long-context inference.CRFM Blog, 10 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2025-12-01

work page 2023

[5] [5]

Devoto, Y

A. Devoto, Y . Zhao, S. Scardapane, and P. Minervini. A simple and effectivel_2 norm-based strategy for KV cache compression. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, Miami, Florida, USA, Nov. 2024. Association for Computational Lingui...

work page doi:10.18653/v1/2024.emnlp-main.1027 2024

[6] [6]

Gonçalves, M

N. Gonçalves, M. V . Treviso, and A. Martins. Adasplash: Adaptive sparse flash atten- tion. InForty-second International Conference on Machine Learning, 2025. URL https: //openreview.net/forum?id=OWIPDWhUcO

work page 2025

[7] [7]

AdaSplash-2: Faster Differentiable Sparse Attention

N. Gonçalves, H. Pitorro, V . Niculae, E. Ponti, L. Li, A. Martins, and M. Treviso. Adasplash-2: Faster differentiable sparse attention.arXiv preprint arXiv:2604.15180, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024. URL https://arxiv.org/abs/2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[10] [10]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024

[11] [11]

C. Lin, J. Tang, S. Yang, H. Wang, T. Tang, B. Tian, I. Stoica, S. Han, and M. Gao. Twilight: Adaptive attention sparsity with hierarchical top-$p$ pruning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=Ve693NkzcU

work page 2025

[12] [12]

Martins and R

A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In M. F. Balcan and K. Q. Weinberger, editors,International Conference on Machine Learning (ICML), volume 48 ofProceedings of Machine Learning Research, pages 1614–1623, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.m...

work page 2016

[13] [13]

A. Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4(7):2025, 2025

work page 2025

[14] [14]

Nawrot, A

P. Nawrot, A. Ła ´ncucki, M. Chochowski, D. Tarjan, and E. Ponti. Dynamic memory compres- sion: Retrofitting LLMs for accelerated inference. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=tDRYrAkOB7

work page 2024

[15] [15]

M. Oren, M. Hassid, N. Yarden, Y . Adi, and R. Schwartz. Transformers are multi-state RNNs. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main

work page doi:10.18653/v1/2024.emnlp-main 2024

[16] [16]

URLhttps://aclanthology.org/2024.emnlp-main.1043/

work page 2024

[17] [17]

In: Zong, C., Xia, F., Li, W., Navigli, R

B. Peters, V . Niculae, and A. F. T. Martins. Sparse sequence-to-sequence models. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1504–1519, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/ P19-1146. URLhttps://www.aclweb.org/anthology/P19-1146

work page doi:10.18653/v1/ 2019

[18] [18]

J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SylKikSYDH

work page 2020

[19] [19]

Singhania, S

P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele. Loki: Low-rank keys for efficient sparse attention. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 16692–16723. Curran Associates, Inc., 2024. doi: 10.52202/ 079017-0532. URL https:...

work page 2024

[20] [20]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=KzACYw0MTV

work page 2024

[21] [21]

T. K. Team. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Vasylenko, M

P. Vasylenko, M. Treviso, and A. F. Martins. Long-context generalization with sparse attention. arXiv preprint arXiv:2506.16640, 2025. 11

work page arXiv 2025

[23] [23]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=NG7sS51zVF

work page 2024

[24] [24]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InThirty-seventh Conference on Neural Information Processing Systems,

work page

[25] [25]

URLhttps://openreview.net/forum?id=RkRrPp7GKO

work page

[26] [26]

K. Zhu, T. Tang, Q. Xu, Y . Gu, Z. Zeng, R. Kadekodi, L. Zhao, A. Li, A. Krishnamurthy, and B. Kasikci. Tactic: Adaptive sparse attention with clustering and distribution fitting for long-context llms.arXiv preprint arXiv:2502.12216, 2025. 12 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 Softmax 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 1.5-Entmax 2 0 2 z1 2 02...

work page arXiv 2025