Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

cs.DC · 2026-05-12 · unverdicted · novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

cs.CL · 2026-05-16 · unverdicted · novelty 5.0

CompactAttention accelerates chunked-prefill attention via Block-Union KV Selection, delivering up to 2.72x speedup at 128K context on LLaMA-3.1-8B while matching dense accuracy on RULER.

citing papers explorer

Showing 3 of 3 citing papers.

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 25
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference cs.DC · 2026-05-12 · unverdicted · none · ref 15
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines with no throughput loss.
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection cs.CL · 2026-05-16 · unverdicted · none · ref 24
CompactAttention accelerates chunked-prefill attention via Block-Union KV Selection, delivering up to 2.72x speedup at 128K context on LLaMA-3.1-8B while matching dense accuracy on RULER.

Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer