arxiv: 2501.01005 · v2 · submitted 2025-01-02 · 💻 cs.DC · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye , Lequn Chen , Ruihang Lai , Wuwei Lin , Yineng Zhang , Stephanie Wang , Tianqi Chen , Baris Kasikci

show 3 more authors

Vinod Grover Arvind Krishnamurthy Luis Ceze

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:21 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords LLM inferenceattention kernelKV cacheblock-sparse formatJIT compilationGPU servingCUDAGraphinference serving

0 comments

The pith

FlashInfer uses block-sparse KV-cache formats and JIT-compiled attention templates to cut inter-token latency by 29-69% in LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlashInfer as a customizable attention engine built for the demands of large language model inference serving. It stores key-value caches in block-sparse and composable formats to reduce memory redundancy and improve access efficiency across heterogeneous request lengths. A just-in-time compilation system generates tailored attention kernels while a load-balanced scheduler adapts to dynamic user traffic without violating the static configuration rules of CUDA graphs. These mechanisms are integrated into existing frameworks such as vLLM and SGLang. A sympathetic reader would care because attention kernels remain the dominant cost in high-throughput serving, and a single engine that stays fast across varying workloads can lower both response times and hardware requirements.

Core claim

FlashInfer provides an attention engine that stores KV caches in block-sparse formats to handle heterogeneous sequence lengths efficiently, supplies customizable attention templates through just-in-time compilation, and applies a load-balanced scheduling algorithm that remains compatible with CUDAGraph static execution, delivering 29-69% inter-token latency reduction versus compiler backends, 28-30% latency reduction for long contexts, and 13-17% speedup under parallel generation.

What carries the argument

Block-sparse KV-cache storage format paired with composable memory layouts, JIT-compiled attention templates, and a load-balanced scheduler that preserves CUDAGraph compatibility.

If this is right

Integrating the engine into existing LLM serving systems reduces inter-token latency by 29-69% relative to current compiler-based attention backends.
Long-context inference workloads experience 28-30% lower end-to-end latency.
Parallel generation scenarios gain 13-17% throughput improvement while keeping CUDA graph compatibility.
The same kernel set supports multiple serving frameworks without per-framework rewrites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The block-sparse layout may extend naturally to other memory-bound GPU kernels if the format overhead remains low at scale.
JIT customization opens a route for serving systems to adopt new attention variants without rebuilding the entire inference stack.
Load balancing tuned for dynamic requests could interact with hardware-specific memory hierarchies in ways that reward further per-GPU tuning.
Wider adoption would shift attention optimization from per-model hand tuning toward reusable, format-driven engines.

Load-bearing premise

The reported speedups assume that block-sparse formats and JIT templates integrate into serving frameworks without hidden compilation or scheduling overheads that appear only under untested production request patterns.

What would settle it

A production-scale benchmark on vLLM or SGLang with highly variable request lengths and high concurrency that measures no net latency improvement after including all JIT compilation and scheduler costs.

read the original abstract

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashInfer bundles block-sparse KV caching, JIT-generated attention templates, and a CUDA-graph-compatible scheduler into one engine that already runs inside SGLang, vLLM, and MLC-Engine.

read the letter

The core contribution is an integrated attention engine that combines three practical pieces: a block-sparse KV-cache layout to cut redundancy, composable templates that JIT-compile for different attention variants, and a load-balanced scheduler that stays compatible with static CUDA graphs. The paper shows these choices already wired into three major serving frameworks and reports concrete end-to-end gains: 29-69% lower inter-token latency versus compiler backends, 28-30% on long-context workloads, and 13-17% with parallel generation. That level of integration and measurement is what makes the work useful rather than another isolated kernel paper. The engineering choices look deliberate for real serving constraints—handling heterogeneous KV caches without constant recompilation and keeping the scheduler graph-friendly. Those are the parts that actually move deployment economics if the numbers hold. The soft spot is the limited visibility into overheads. The abstract gives speedups but does not break out JIT compilation time, template-switch costs, or scheduler rebalancing latency under bursty or variable-batch traffic. If those costs are not fully amortized in the reported runs, the headline gains could shrink in production patterns the benchmarks may not have stressed. The paper also does not appear to include error bars or exhaustive baseline details in the summary, so the robustness of the 29-69% range needs checking against the full methods. This paper is for teams that maintain LLM serving stacks and want lower latency without a full rewrite. It is not a theoretical advance but a solid systems artifact with reproducible integration points. The measurements are grounded enough and the integrations give it external credibility, so it deserves a serious referee rather than a desk reject. I would bring it to a reading group focused on inference systems.

Referee Report

2 major / 1 minor

Summary. The paper presents FlashInfer, an attention engine for LLM inference serving that addresses KV-cache heterogeneity via block-sparse and composable formats, provides customizable attention kernels through JIT compilation, and introduces a load-balanced scheduler that remains compatible with CUDAGraph's static requirements. It reports 29-69% inter-token latency reduction versus compiler backends, 28-30% latency reduction for long-context inference, and 13-17% speedup for parallel generation, with integrations into SGLang, vLLM, and MLC-Engine.

Significance. If the measured speedups prove robust, the work offers a practical engineering advance for high-throughput LLM serving by supplying a flexible, high-performance attention backend that can be dropped into existing frameworks. The emphasis on composable formats and JIT templates addresses real heterogeneity in production workloads, and the CUDAGraph compatibility is a notable strength for static-graph serving stacks.

major comments (2)

[Evaluation] Evaluation section: the reported 29-69% inter-token latency reductions and other headline figures are given without isolating JIT compilation overhead, template-switch costs, or dynamic re-scheduling latency under variable arrival rates and batch sizes; these unmeasured costs could directly erode the claimed gains when the system is placed under production-like request patterns.
[Scheduling] Scheduling description: the load-balanced scheduler is asserted to maintain CUDAGraph compatibility while handling dynamic user requests, yet no concrete mechanism, pseudocode, or timing breakdown is supplied showing how static graph capture is preserved across frequent template changes or request heterogeneity.

minor comments (1)

[Abstract] Abstract: grammatical error ('FlashInfer have been integrated' should read 'FlashInfer has been integrated').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each major comment below and have made revisions to incorporate additional details and measurements as suggested.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 29-69% inter-token latency reductions and other headline figures are given without isolating JIT compilation overhead, template-switch costs, or dynamic re-scheduling latency under variable arrival rates and batch sizes; these unmeasured costs could directly erode the claimed gains when the system is placed under production-like request patterns.

Authors: The headline figures are derived from end-to-end benchmarks that incorporate all system components, including JIT compilation and dynamic scheduling under realistic workloads. To address the concern more explicitly, we will revise the evaluation section to include a new subsection with microbenchmarks that isolate the JIT overhead, template-switch costs, and re-scheduling latency across different arrival rates and batch sizes. This will confirm that these costs do not significantly erode the reported gains. revision: yes
Referee: [Scheduling] Scheduling description: the load-balanced scheduler is asserted to maintain CUDAGraph compatibility while handling dynamic user requests, yet no concrete mechanism, pseudocode, or timing breakdown is supplied showing how static graph capture is preserved across frequent template changes or request heterogeneity.

Authors: We agree that more detail is needed. In the revised manuscript, we will expand the scheduling section to include a concrete description of the mechanism, pseudocode for the load-balanced scheduler, and an explanation of how static graph capture is maintained (e.g., by capturing graphs for a set of common templates and using a dynamic dispatcher for heterogeneous requests). We will also add a timing breakdown to quantify the costs of template changes. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering artifact validated by direct measurements

full rationale

The paper describes an implementation of attention kernels using block-sparse KV-cache formats, composable layouts, JIT-compiled templates, and a load-balanced scheduler compatible with CUDAGraph. All performance claims (29-69% latency reductions, etc.) rest on empirical wall-clock timings from kernel-level and end-to-end benchmarks against existing frameworks. No equations, fitted parameters, or derivations are presented that could reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an engineering system rather than a theoretical derivation, so it introduces no new mathematical axioms, free parameters, or invented physical entities.

pith-pipeline@v0.9.0 · 5560 in / 1055 out tokens · 18761 ms · 2026-05-16T13:21:57.766174+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
cs.DC 2026-05 accept novelty 7.0

LLM-Emu is a serving-native emulator for vLLM that replaces GPU execution with profile-driven latency sampling and achieves under 5% error on TPOT, ITL, E2E latency, and throughput across multiple models, GPUs, and workloads.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
cs.PL 2026-04 unverdicted novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
cs.DC 2026-03 unverdicted novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
cs.CV 2026-03 conditional novelty 7.0

Attention sparsity in video DiTs is an input-stable layer-wise property, enabling offline profiling and online bidirectional QK co-clustering for up to 1.93x speedup with PSNR up to 29 dB.
VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
cs.DC 2026-02 unverdicted novelty 7.0

VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kern...
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
cs.PF 2026-01 unverdicted novelty 6.0

PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
cs.CL 2026-01 unverdicted novelty 6.0

HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x fas...
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
cs.DC 2026-05 unverdicted novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 19 Pith papers · 3 internal anchors

[1]

URL https://arxiv.org/abs/2004. 05150. Buluç, A., Fineman, J. T., Frigo, M., Gilbert, J. R., and Leiserson, C. E. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using com- pressed sparse blocks. In auf der Heide, F. M. and Bender, M. A. (eds.), SPAA 2009: Proceedings of the 21st Annual ACM Symposium on Parallelism in Algo- rithms...

work page doi:10.1145/1583991.1584053 2004
[2]

URL https: //doi.org/10.1145/3458817.3476182

doi: 10.1145/3458817.3476182. URL https: //doi.org/10.1145/3458817.3476182. Chen, Z., May, A., Svirschevski, R., Huang, Y ., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scal- able, robust, and hardware-aware speculative decoding. CoRR, abs/2402.12374, 2024. doi: 10.48550/ARXIV . 2402.12374. URL https://doi.org/10.48550/ arXiv.2402.12374. Chiang, W.-L., L...

work page doi:10.1145/3458817.3476182 2024
[5]

org/paper_files/paper/2024/file/ 5321b1dabcd2be188d796c21b733e8c7- Paper-Conference.pdf

URL https://proceedings.mlsys. org/paper_files/paper/2024/file/ 5321b1dabcd2be188d796c21b733e8c7- Paper-Conference.pdf. Im, E., Yelick, K. A., and Vuduc, R. W. Sparsity: Op- timization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135–158, 2004. doi: 10.1177/1094342004041296. URL https:// doi.org/10.1177/1094342004041296....

work page doi:10.1177/1094342004041296 2024
[7]

Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

doi: 10.1109/SC41404.2022.00042. URL https: //doi.org/10.1109/SC41404.2022.00042. Lin, C., Han, Z., Zhang, C., Yang, Y ., Yang, F., Chen, C., and Qiu, L. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24) , pp. 929–945, Santa Clara, CA, July

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00042 2022
[9]

URL https://doi

doi: 10.1145/55364.55378. URL https://doi. org/10.1145/55364.55378. MLC Community. Optimizing and characteriz- ing high-throughput low-latency LLM infer- ence in MLCEngine, Oct 2024. URL https: //blog.mlc.ai/2024/10/10/optimizing- and-characterizing-high-throughput- low-latency-llm-inference . [Online; ac- cessed April 23, 2025]. Mostafa, H. Sequential ag...

work page doi:10.1145/55364.55378 2024
[10]

Block-Sparse Recurrent Neural Networks

URL https://proceedings.mlsys. org/paper_files/paper/2022/hash/ 1d781258d409a6efc66cd1aa14a1681c- Abstract.html. Narang, S., Undersander, E., and Diamos, G. F. Block- sparse recurrent neural networks. CoRR, abs/1711.02782,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

URL http://arxiv.org/abs/1711. 02782. Nguyen, V ., Carilli, M., Eryilmaz, S. B., Singh, V ., Lin, M., Gimelshein, N., Desmaison, A., and Yang, E. Ac- celerating PyTorch with CUDA Graphs. https: //pytorch.org/blog/accelerating- pytorch-with-cuda-graphs/ , 2021. [Ac- cessed 19-10-2024]. NVIDIA. FasterTransformer. https://github.com/ NVIDIA/FasterTransformer...

work page 2021
[12]

URL https: //doi.org/10.1145/3572848.3577479

doi: 10.1145/3572848.3577479. URL https: //doi.org/10.1145/3572848.3577479. Ozen, G. Nvdsl: Simplifying tensor cores with python- driven mlir metaprogramming. In Efficient Systems for Foundation Models (ES-FoMo) Workshop at ICML 2024, 2024. Ponnusamy, R., Saltz, J. H., and Choudhary, A. N. Run- time compilation techniques for data partitioning and communi...

work page doi:10.1145/3572848.3577479 2024
[13]

doi: 10.1145/169627.169752

ACM, 1993. doi: 10.1145/169627.169752. URL https://doi.org/10.1145/169627.169752. Prabhu, R., Nayak, A., Mohan, J., Ramjee, R., and Panwar, A. vattention: Dynamic memory management for serving llms without pagedattention, 2024. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. I...

work page doi:10.1145/169627.169752 1993
[16]

SGLang: Efficient Execution of Structured Language Model Programs

URL https://proceedings.mlsys. org/paper_files/paper/2022/hash/ b559156047e50cf316207249d0b5a6c5- Abstract.html. Zheng, L., Chiang, W., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. In Oh, A., Naumann, T., Globerson, A., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
[17]

Advanced NVIDIA GPU Features. CUTLASS supports specialized GPU capabilities such as warp- specialization (NVIDIA, 2024a) and TMA instruc- tions (NVIDIA, 2022), which are experimental or un- supported in Triton at this moment

work page 2022
[18]

While Triton provides tile-level abstractions, CUDA/CUTLASS af- fords finer control over thread-level registers

Fine-Grained Kernel Optimization. While Triton provides tile-level abstractions, CUDA/CUTLASS af- fords finer control over thread-level registers. This flex- ibility simplifies incorporating low-level optimizations (e.g., PTX intrinsics) directly into our JIT templates, which is more challenging in Triton. Our load-balancing scheduler design (Section 3.3....

work page 2022
[19]

composable

on different attention variants using the Atten- tionGym (PyTorch-Labs, 2024) benchmark on NVIDIA H100 80GB SXM. We evaluated with batch size16, number of heads 16 and head dim 128, the CUDA version and the Triton version were fixed to 12.4 and 3.2, respectivelyre- spectively. Tables 1 to 4 show the performance of FlashInfer and FlexAttention in TFLOPS/s,...

work page 2024
[20]

“RR” in the tables means request rate. G.4 vLLM Integration Evaluation We compare the vLLM with FlashInfer backend and its default backend with a fixed request rate of 16, reporting throughput (tokens/s), inter-token latency (ITL, ms), and time-to-first-token (TTFT, ms) in Table 8. FlashInfer re- duces ITL by aroudn 13% using fp8 KV-cache, but heavy Pytho...

work page 2022