Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, et al · 2025 · arXiv 2506.08889

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 2 unclear 1 use method 1

representative citing papers

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

cs.CL · 2025-10-10 · conditional · novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

cs.CL · 2025-12-12 · unverdicted · novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

cs.LG · 2026-04-16 · unverdicted · novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

cs.DC · 2026-03-30 · unverdicted · novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

citing papers explorer

Showing 8 of 8 citing papers.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning cs.CL · 2025-10-10 · conditional · none · ref 7
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 60
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction cs.LG · 2026-05-10 · unverdicted · none · ref 10
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 19
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 8
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference cs.LG · 2026-05-08 · unverdicted · none · ref 14
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning cs.LG · 2026-04-16 · unverdicted · none · ref 9
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference cs.DC · 2026-03-30 · unverdicted · none · ref 7
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer