hub Canonical reference

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang · 2024 · cs.CL · arXiv 2410.10819

Canonical reference. 75% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1 other 1

citation-polarity summary

background 6 baseline 1 unclear 1

representative citing papers

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

cs.CV · 2026-05-20 · conditional · novelty 7.0

HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

cs.DC · 2026-05-13 · conditional · novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

cs.DC · 2026-05-12 · unverdicted · novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.

Compute Where it Counts: Self Optimizing Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at 128K context.

Training Transformers for KV Cache Compressibility

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

cs.CL · 2025-12-12 · unverdicted · novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

cs.LG · 2025-10-21 · unverdicted · novelty 6.0

A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

Accelerating Prefilling via Decoding-time Contribution Sparsity

cs.CL · 2025-07-29 · conditional · novelty 6.0

TriangleMix exploits decoding-time contribution sparsity via a training-free static attention pattern to accelerate LLM prefilling with nearly lossless performance.

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

cs.CL · 2025-02-16 · unverdicted · novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

cs.CL · 2024-07-16 · accept · novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.

TIDE: Every Layer Knows the Token Beneath the Context

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

cs.DC · 2026-04-18 · unverdicted · novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and

The Pitfalls of KV Cache Compression

cs.LG · 2025-09-30 · conditional · novelty 5.0

KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

cs.PF · 2025-08-22 · unverdicted · novelty 5.0

ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

cs.LG · 2026-02-20 · 2 refs

citing papers explorer

Showing 25 of 25 citing papers.

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation cs.CV · 2026-05-20 · conditional · none · ref 35 · internal anchor
HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving cs.DC · 2026-05-13 · conditional · none · ref 48 · internal anchor
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models cs.DC · 2026-04-08 · unverdicted · none · ref 49 · internal anchor
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 46 · internal anchor
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
SparseSAM: Structured Sparsification of Activations in Segment Anything Models cs.CV · 2026-05-17 · unverdicted · none · ref 29 · internal anchor
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 69 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference cs.DC · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.
Compute Where it Counts: Self Optimizing Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference cs.CL · 2026-05-08 · unverdicted · none · ref 44 · internal anchor
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity cs.LG · 2026-05-07 · unverdicted · none · ref 29 · internal anchor
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility cs.AI · 2026-05-07 · unverdicted · none · ref 18 · internal anchor
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at 128K context.
Training Transformers for KV Cache Compressibility cs.LG · 2026-05-07 · unverdicted · none · ref 54 · 2 links · internal anchor
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference cs.DC · 2026-04-07 · unverdicted · none · ref 77 · internal anchor
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation cs.LG · 2026-02-09 · unverdicted · none · ref 18 · internal anchor
CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 22 · internal anchor
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs cs.LG · 2025-10-21 · unverdicted · none · ref 46 · internal anchor
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
Accelerating Prefilling via Decoding-time Contribution Sparsity cs.CL · 2025-07-29 · conditional · none · ref 13 · internal anchor
TriangleMix exploits decoding-time contribution sparsity via a training-free static attention pattern to accelerate LLM prefilling with nearly lossless performance.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 49 · internal anchor
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cs.CL · 2024-07-16 · accept · none · ref 41 · internal anchor
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 112 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 29 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
The Pitfalls of KV Cache Compression cs.LG · 2025-09-30 · conditional · none · ref 14 · internal anchor
KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 63 · internal anchor
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unreviewed · ref 32 · 2 links · internal anchor
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 33 · internal anchor

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer