BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Bo Li; Cameron Shinn; Dominic Brown; George Klimiashvili; Guangxuan Xiao; Huizi Mao; Jiayi Yuan; Jingze Cui; John D. Owens; Julien Demouth

arxiv: 2512.12087 · v3 · submitted 2025-12-12 · 💻 cs.CL

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan , Cameron Shinn , Kai Xu , Jingze Cui , George Klimiashvili , Guangxuan Xiao , Perkz Zheng , Bo Li

show 14 more authors

Yuxin Zhou Zhouhai Ye Weijie You Tian Zheng Dominic Brown Pengbo Wang Markus Hoehnerbach Richard Cai Julien Demouth John D. Owens Xia Hu Song Han Timmy Liu Huizi Mao

This is my paper

Pith reviewed 2026-05-16 22:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords dynamic sparse attentionLLM inferencesoftmax thresholdingattention sparsityprefill speedupdecode optimizationblocked attentionno training required

0 comments

The pith

BLASST skips attention blocks with a single fixed threshold on softmax statistics to accelerate LLM inference without training or accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BLASST is a dynamic sparse attention method that applies one fixed scalar threshold to online softmax statistics to identify and skip negligible blocks. By bypassing softmax calculations, value loads, and matrix multiplications for those blocks, it reduces computation while reusing statistics already present in standard attention. The approach requires no model retraining, no precomputation passes, and works as a drop-in replacement across MHA, GQA, MQA, and MLA variants. Automated calibration reveals that the optimal threshold scales inversely with context length, so only one threshold each for prefill and decode is needed per model. This delivers 1.52x prefill speedup at 71.9% sparsity and 1.48x decode speedup at 73.2% sparsity on modern GPUs while preserving benchmark accuracy.

Core claim

BLASST achieves dynamic blocked attention sparsity by applying a fixed threshold to reused online softmax statistics, allowing the system to skip softmax, value block loads, and subsequent matrix multiplications for negligible attention blocks. The method supports all major attention variants, needs no training or precomputation, and uses an automated calibration procedure that identifies a simple inverse relationship between optimal threshold and context length. This calibration allows a single threshold for prefill and a single threshold for decode per model. Optimized kernels implement the skipping with negligible overhead, resulting in 1.52x speedup for prefill at 71.9% sparsity and 1.48

What carries the argument

BLASST's reuse of online softmax statistics with a fixed scalar threshold to detect and skip negligible attention blocks.

If this is right

Accelerates both prefill and decode phases on modern GPUs at over 70% sparsity levels.
Applies without modification to MHA, GQA, MQA, and MLA attention variants.
Requires only one threshold per phase per model because of the inverse relationship with context length.
Integrates into existing frameworks via optimized kernels that add negligible latency.
Preserves benchmark accuracy at the reported sparsity levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-threshold simplicity could reduce engineering effort needed to deploy sparse attention in production serving systems.
Lower memory bandwidth from skipped blocks may enable longer contexts on the same hardware without increasing batch size.
The observed inverse threshold-context relationship could be tested as a general calibration heuristic for other dynamic sparsity techniques.
Combining the block-skipping logic with quantization or other kernel fusions might compound the speedups beyond the reported 1.5x.

Load-bearing premise

A single fixed threshold per model chosen via automated calibration maintains accuracy across varying context lengths, tasks, and model architectures without post-hoc adjustments or significant degradation.

What would settle it

Significant accuracy drop on standard benchmarks when the calibrated threshold is applied to a substantially different context length, task, or model architecture than those used during calibration.

Figures

Figures reproduced from arXiv: 2512.12087 by Bo Li, Cameron Shinn, Dominic Brown, George Klimiashvili, Guangxuan Xiao, Huizi Mao, Jiayi Yuan, Jingze Cui, John D. Owens, Julien Demouth, Kai Xu, Markus Hoehnerbach, Pengbo Wang, Perkz Zheng, Richard Cai, Song Han, Tian Zheng, Timmy Liu, Weijie You, Xia Hu, Yuxin Zhou, Zhouhai Ye.

**Figure 1.** Figure 1: Overview of BLASST. Blocks along a row of the attention matrix are sequentially processed. We (1) update the running row max (m(j) ) as in FlashAttention, (2) compute the block max (m˜ (j) ) for each Sj block (QK⊤ j ), and (3) skip subsequent work if the block max is lower than the running max by more than the input threshold, ln(λ). Full details can be found in Algorithm 1. arXiv:2512.12087v1 [cs.CL] 12 … view at source ↗

**Figure 2.** Figure 2: (Left) Relative accuracy drop across different datasets and context lengths shows consistent degradation patterns. All curves are normalized to their initial accuracy. (Right) Relationship between threshold and achieved sparsity levels across different sequence lengths, demonstrating the need for threshold calibration to maintain fixed sparsity across varying contexts. Through empirical analysis, we find t… view at source ↗

**Figure 3.** Figure 3: Prefill pipeline schedules for FlashAttention and BLASST at 50% sparsity across 4 loop iterations (L0-L3). Rows are separated based on warp/warpgroup specializations. Darker and lighter hues correspond to ops for different tile rows (T0/T1). The MMA warp’s BMM1 and BMM2 ops are indicated with B1 and B2. The softmax warpgroups are primarily bottlenecked by exponentiation (EX2), but they also perform the ski… view at source ↗

**Figure 4.** Figure 4: Decode pipeliene schedules for FlashAttention and BLASST skipping loops 1, 2, and 4. The prologue is not shown, and we focus on the steady state of the first 6 loop iterations (L0-L5). We split out the TMA warp’s pipeline stages to show how multiple TMA loads are issued at once. Loads in Figure 4b finish more quickly because there are fewer simultaneous loads. Arrows indicate scoreboard dependencies from t… view at source ↗

**Figure 5.** Figure 5: Speedup of BLASST prefill on Hopper GPU (H200) 5.3 GPU Kernel Performance We implement and benchmark highly optimized kernels for both Blackwell (B200) and Hopper (H200) GPU architectures, demonstrating that BLASST achieves substantial real-world speedups [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Sparsity distribution across layers and heads for Llama8B on 8K context. Taken from NIAH benchmark sample with threshold λ = 0.03. Substantial head-level and layer-level variance motivates adaptive thresholding strategies. 50%-75%, sparse-trained models achieve substantially better accuracy than applying sparsity post-training, reducing accuracy degradation by up to 1.7×. These results confirm that mode… view at source ↗

**Figure 8.** Figure 8: compares standard sequential processing against reordered processing on VT and FWE tasks. The results show dataset-dependent behavior: reordering yields similar performance on VT but provides noticeable improvements on FWE. This suggests that the effectiveness of reordering largely depends on the specific attention patterns of each dataset. Nevertheless, this demonstrates a valuable property of BLASST: the… view at source ↗

**Figure 9.** Figure 9: Accuracy-sparsity trade-off at high sparsity levels on RULER-16K for Qwen3-8B. BLASST shows more stable degradation compared to XAttention, maintaining better accuracy at aggressive sparsity settings. This shows the effectiveness of using actual softmax statistics versus proxy-based importance scores. Extreme Sparsity Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BLASST gives a practical fixed-threshold way to skip attention blocks using online softmax stats, delivering real GPU speedups without training, but the single-threshold accuracy claim needs tighter testing on varied lengths and tasks.

read the letter

The main takeaway is a drop-in sparse attention method that skips whole blocks when the max score in the block is below a fixed scalar threshold. It reuses the online softmax computation already happening, so it avoids loading value blocks and the final matmul with almost no extra cost. This runs on both prefill and decode, works across MHA/GQA/MQA/MLA, and they provide optimized kernels that add negligible latency. They calibrate one threshold per model (separate for prefill and decode) via an automated procedure that shows a simple inverse link to context length, then report 1.52x prefill and 1.48x decode speedups at roughly 72% sparsity while keeping benchmark accuracy.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BLASST, a training-free dynamic sparse attention mechanism for LLMs that uses a single fixed scalar threshold (calibrated automatically per model for prefill and decode) to identify and skip negligible attention blocks. By reusing online softmax statistics, the method skips softmax computation, value block loads, and matrix multiplications, supporting MHA/GQA/MQA/MLA variants and delivering optimized GPU kernels. The central empirical claims are 1.52x prefill speedup at 71.9% sparsity and 1.48x decode speedup at 73.2% sparsity while preserving benchmark accuracy, together with an observed inverse relationship between optimal threshold and context length.

Significance. If the single-threshold calibration generalizes, BLASST would offer a low-overhead, hardware-friendly sparse attention primitive that removes training and pre-computation barriers present in prior work, directly addressing memory and compute bottlenecks in long-context inference. The reported speedups at high sparsity levels, combined with drop-in compatibility, would constitute a practically significant contribution to efficient LLM deployment.

major comments (2)

[Abstract] Abstract: the headline claim that a single fixed threshold per model (chosen via automated calibration) preserves accuracy across contexts, tasks, and architectures rests on an empirical inverse relationship between threshold and context length, yet no formal approximation-error bound or analysis of tail behavior in attention-score distributions is provided; without this, silent degradation on longer sequences or out-of-calibration tasks cannot be ruled out.
[Abstract] Abstract: the reported speedups and sparsity figures are given without accompanying details on the benchmark suite, number of runs, statistical significance tests, or exact exclusion criteria for block skipping, making it impossible to verify that accuracy preservation is robust rather than benchmark-specific.

minor comments (2)

Clarify the precise interaction of the threshold with online softmax renormalization when blocks are skipped, including any edge cases for very short sequences or GQA/MQA head grouping.
Provide pseudocode or a small worked example illustrating how the reused online statistics translate into the decision to skip a block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing BLASST. We address each of the major comments point by point below, providing clarifications and indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that a single fixed threshold per model (chosen via automated calibration) preserves accuracy across contexts, tasks, and architectures rests on an empirical inverse relationship between threshold and context length, yet no formal approximation-error bound or analysis of tail behavior in attention-score distributions is provided; without this, silent degradation on longer sequences or out-of-calibration tasks cannot be ruled out.

Authors: We agree that providing a formal approximation-error bound would strengthen the theoretical foundation. Our method relies on empirical calibration showing an inverse relationship between the optimal threshold and context length, which we have validated across a range of lengths and tasks. We will revise the manuscript to include a more detailed analysis of the attention score distributions' tail behavior and discuss the implications for longer sequences. A rigorous formal bound, however, would require additional theoretical development that is outside the current scope focused on practical implementation. revision: partial
Referee: [Abstract] Abstract: the reported speedups and sparsity figures are given without accompanying details on the benchmark suite, number of runs, statistical significance tests, or exact exclusion criteria for block skipping, making it impossible to verify that accuracy preservation is robust rather than benchmark-specific.

Authors: We appreciate this observation and will update the abstract to provide more context. The experiments were conducted on standard benchmarks including LongBench and other common evaluation suites for long-context LLMs. Results are averaged over multiple runs with different random seeds to ensure statistical reliability, and we apply block skipping when the maximum attention score within a block is below the threshold. We will also include references to the statistical tests performed, which showed no significant accuracy loss within the reported margins. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic skipping logic independent of calibrated threshold

full rationale

The paper defines BLASST via an explicit algorithmic procedure that reuses online softmax statistics to apply a fixed scalar threshold for block skipping. The threshold itself is obtained from an automated calibration pass on data, but neither the sparsity level nor the reported speedups are derived from that scalar by construction; they are measured directly on optimized GPU kernels. No equations reduce the core claim to the calibration input, no self-citations or uniqueness theorems are invoked as load-bearing premises, and the method does not rename or smuggle prior results. The derivation chain is therefore self-contained as an engineering optimization with empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that attention blocks below a calibrated threshold can be dropped with negligible effect on output quality; the threshold itself is a free parameter fitted per model.

free parameters (1)

threshold
Single scalar per model for prefill and decode, calibrated automatically and observed to follow an inverse relationship with context length.

axioms (1)

domain assumption Negligible attention scores identified via online softmax statistics can be safely skipped without retraining or accuracy loss
Invoked to justify block skipping in both prefill and decode phases.

pith-pipeline@v0.9.0 · 5609 in / 1137 out tokens · 28217 ms · 2026-05-16T22:17:41.600523+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Runtime-Certified Bounded-Error Quantized Attention
cs.LG 2026-05 unverdicted novelty 6.0

A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
cs.CL 2026-05 unverdicted novelty 6.0

RTPurbo exploits intrinsic sparsity in full-attention LLMs to achieve near-lossless sparse inference after only a few hundred training steps via retrieval-head identification and a lightweight token indexer.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
cs.LG 2026-04 unverdicted novelty 5.0

VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 4 Pith papers · 19 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Bai, Y ., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y ., et al. Long- bench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks.arXiv preprint arXiv:2412.15204,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression.arXiv preprint arXiv:2502.14051,

Behnam, P., Fu, Y ., Zhao, R., Tsai, P.-A., Yu, Z., and Tu- manov, A. Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression.arXiv preprint arXiv:2502.14051,

work page arXiv
[4]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

How to train long- context language models (effectively)

Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long- context language models (effectively).arXiv preprint arXiv:2410.02660,

work page arXiv
[8]

Seerattention-r: Sparse attention adaptation for long reasoning

Gao, Y ., Guo, S., Cao, S., Xia, Y ., Cheng, Y ., Wang, L., Ma, L., Sun, Y ., Ye, T., Dong, L., et al. Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889,

work page arXiv
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

work page arXiv
[13]

Ła´ncucki, A., Staniszewski, K., Nawrot, P., and Ponti, E. M. Inference-time hyper-scaling with kv cache compression. arXiv preprint arXiv:2506.05345,

work page arXiv
[14]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, J., Tian, J. L., Daita, V ., Wei, Y ., Ding, Y ., Wang, Y . K., Yang, J., and Zhang, L. Repoqa: Evaluating long context code ...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated atten- tion for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Code Llama: Open Foundation Models for Code

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Rectified sparse attention, 2025

Sun, Y ., Ye, T., Dong, L., Xia, Y ., Chen, J., Gao, Y ., Cao, S., Wang, J., and Wei, F. Rectified sparse attention.arXiv preprint arXiv:2506.04108,

work page arXiv
[19]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Wu, W., Wang, Y ., Xiao, G., Peng, H., and Fu, Y . Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

work page arXiv
[21]

Efficient Streaming Language Models with Attention Sinks

Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., and Sun, M. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37: 119638–119661, 2024a. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024b. Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

work page internal anchor Pith review arXiv
[23]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2410.05076

Yang, L., Zhang, Z., Chen, Z., Li, Z., and Jia, Z. Tidalde- code: Fast and accurate llm decoding with position per- sistent sparse attention.arXiv preprint arXiv:2410.05076,

work page arXiv
[25]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., et al. Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y ., Wang, L., Xiao, Z., et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

work page internal anchor Pith review arXiv
[27]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse atten- tion accelerating any model inference.arXiv preprint arXiv:2502.18137,

work page arXiv
[29]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Bai, Y ., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y ., et al. Long- bench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks.arXiv preprint arXiv:2412.15204,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression.arXiv preprint arXiv:2502.14051,

Behnam, P., Fu, Y ., Zhao, R., Tsai, P.-A., Yu, Z., and Tu- manov, A. Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression.arXiv preprint arXiv:2502.14051,

work page arXiv

[4] [4]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[5] [5]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

How to train long- context language models (effectively)

Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long- context language models (effectively).arXiv preprint arXiv:2410.02660,

work page arXiv

[8] [8]

Seerattention-r: Sparse attention adaptation for long reasoning

Gao, Y ., Guo, S., Cao, S., Xia, Y ., Cheng, Y ., Wang, L., Ma, L., Sun, Y ., Ye, T., Dong, L., et al. Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889,

work page arXiv

[9] [9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

work page arXiv

[13] [13]

Ła´ncucki, A., Staniszewski, K., Nawrot, P., and Ponti, E. M. Inference-time hyper-scaling with kv cache compression. arXiv preprint arXiv:2506.05345,

work page arXiv

[14] [14]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, J., Tian, J. L., Daita, V ., Wei, Y ., Ding, Y ., Wang, Y . K., Yang, J., and Zhang, L. Repoqa: Evaluating long context code ...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated atten- tion for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Code Llama: Open Foundation Models for Code

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Rectified sparse attention, 2025

Sun, Y ., Ye, T., Dong, L., Xia, Y ., Chen, J., Gao, Y ., Cao, S., Wang, J., and Wei, F. Rectified sparse attention.arXiv preprint arXiv:2506.04108,

work page arXiv

[19] [19]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Wu, W., Wang, Y ., Xiao, G., Peng, H., and Fu, Y . Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

work page arXiv

[21] [21]

Efficient Streaming Language Models with Attention Sinks

Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., and Sun, M. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37: 119638–119661, 2024a. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024b. Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

work page internal anchor Pith review arXiv

[23] [23]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2410.05076

Yang, L., Zhang, Z., Chen, Z., Li, Z., and Jia, Z. Tidalde- code: Fast and accurate llm decoding with position per- sistent sparse attention.arXiv preprint arXiv:2410.05076,

work page arXiv

[25] [25]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., et al. Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y ., Wang, L., Xiao, Z., et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

work page internal anchor Pith review arXiv

[27] [27]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse atten- tion accelerating any model inference.arXiv preprint arXiv:2502.18137,

work page arXiv

[29] [29]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023