hub Mixed citations

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han · 2024 · cs.CL · arXiv 2406.10774

Mixed citation behavior. Most common role is background (67%).

46 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1 method 1

citation-polarity summary

background 4 baseline 1 use method 1

representative citing papers

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

cs.CL · 2025-10-10 · conditional · novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

MURMUR: An Efficient Inference System for Long-Form ASR

cs.LG · 2026-05-31 · conditional · novelty 6.0

Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.

WorldKV: Efficient World Memory with World Retrieval and Compression

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

cs.DC · 2026-05-12 · unverdicted · novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.

Compute Where it Counts: Self Optimizing Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at 128K context.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

cs.CR · 2026-04-27 · unverdicted · novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.

Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level selection.

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

citing papers explorer

Showing 39 of 39 citing papers after filters.

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation cs.CL · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 61 · internal anchor
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression cs.CL · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 21 · internal anchor
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 33 · internal anchor
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
WorldKV: Efficient World Memory with World Retrieval and Compression cs.CV · 2026-05-21 · unverdicted · none · ref 21 · internal anchor
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 44 · internal anchor
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs cs.LG · 2026-05-18 · unverdicted · none · ref 25 · internal anchor
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 64 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference cs.DC · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.
Compute Where it Counts: Self Optimizing Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction cs.LG · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 65 · internal anchor
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference cs.CL · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache cs.LG · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility cs.AI · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at 128K context.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 48 · internal anchor
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing cs.CR · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression cs.LG · 2026-04-23 · unverdicted · none · ref 15 · internal anchor
Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level selection.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression eess.SP · 2026-04-18 · unverdicted · none · ref 15 · internal anchor
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 9 · internal anchor
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation cs.LG · 2026-02-09 · unverdicted · none · ref 15 · internal anchor
CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference cs.CL · 2026-01-20 · unverdicted · none · ref 10 · internal anchor
HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x faster decoding at 224K context length.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 19 · internal anchor
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
CacheClip: Accelerating RAG with Effective KV Cache Reuse cs.LG · 2025-10-11 · unverdicted · none · ref 35 · internal anchor
CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression cs.CL · 2025-09-25 · unverdicted · none · ref 15 · internal anchor
OjaKV introduces hybrid full-rank storage for key tokens combined with online low-rank KV cache compression via Oja's algorithm to support memory-efficient long-context LLM inference.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval cs.CV · 2025-05-21 · unverdicted · none · ref 29 · internal anchor
LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
MoBA: Mixture of Block Attention for Long-Context LLMs cs.LG · 2025-02-18 · unverdicted · none · ref 27 · internal anchor
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 80 · internal anchor
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM cs.CL · 2026-06-28 · unverdicted · none · ref 54 · internal anchor
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference cs.CL · 2026-05-25 · unverdicted · none · ref 24 · internal anchor
IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling cs.DC · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 57 · internal anchor
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
PiKV: KV Cache Management System for Mixture of Experts cs.DC · 2025-08-02 · unverdicted · none · ref 20 · internal anchor
PiKV proposes expert-sharded KV storage, PiKV routing, adaptive scheduling, and compression modules to reduce overhead in multi-GPU MoE inference.
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing cs.LG · 2025-07-29 · unverdicted · none · ref 24 · internal anchor
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM cs.CL · 2025-05-09 · unverdicted · none · ref 17 · internal anchor
STARC remaps sparse KV caches by semantic clustering for PIM hardware, delivering 19-31% lower attention latency and 19-27% lower energy versus token-wise sparsity, with larger gains under tight KV budgets.
Comparative Characterization of KV Cache Management Strategies for LLM Inference cs.AR · 2026-04-06 · unverdicted · none · ref 14 · internal anchor
Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer