Recognition: 2 theorem links
· Lean TheoremFlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Pith reviewed 2026-05-16 13:21 UTC · model grok-4.3
The pith
FlashInfer uses block-sparse KV-cache formats and JIT-compiled attention templates to cut inter-token latency by 29-69% in LLM serving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashInfer provides an attention engine that stores KV caches in block-sparse formats to handle heterogeneous sequence lengths efficiently, supplies customizable attention templates through just-in-time compilation, and applies a load-balanced scheduling algorithm that remains compatible with CUDAGraph static execution, delivering 29-69% inter-token latency reduction versus compiler backends, 28-30% latency reduction for long contexts, and 13-17% speedup under parallel generation.
What carries the argument
Block-sparse KV-cache storage format paired with composable memory layouts, JIT-compiled attention templates, and a load-balanced scheduler that preserves CUDAGraph compatibility.
If this is right
- Integrating the engine into existing LLM serving systems reduces inter-token latency by 29-69% relative to current compiler-based attention backends.
- Long-context inference workloads experience 28-30% lower end-to-end latency.
- Parallel generation scenarios gain 13-17% throughput improvement while keeping CUDA graph compatibility.
- The same kernel set supports multiple serving frameworks without per-framework rewrites.
Where Pith is reading between the lines
- The block-sparse layout may extend naturally to other memory-bound GPU kernels if the format overhead remains low at scale.
- JIT customization opens a route for serving systems to adopt new attention variants without rebuilding the entire inference stack.
- Load balancing tuned for dynamic requests could interact with hardware-specific memory hierarchies in ways that reward further per-GPU tuning.
- Wider adoption would shift attention optimization from per-model hand tuning toward reusable, format-driven engines.
Load-bearing premise
The reported speedups assume that block-sparse formats and JIT templates integrate into serving frameworks without hidden compilation or scheduling overheads that appear only under untested production request patterns.
What would settle it
A production-scale benchmark on vLLM or SGLang with highly variable request lengths and high concurrency that measures no net latency improvement after including all JIT compilation and scheduler costs.
read the original abstract
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FlashInfer, an attention engine for LLM inference serving that addresses KV-cache heterogeneity via block-sparse and composable formats, provides customizable attention kernels through JIT compilation, and introduces a load-balanced scheduler that remains compatible with CUDAGraph's static requirements. It reports 29-69% inter-token latency reduction versus compiler backends, 28-30% latency reduction for long-context inference, and 13-17% speedup for parallel generation, with integrations into SGLang, vLLM, and MLC-Engine.
Significance. If the measured speedups prove robust, the work offers a practical engineering advance for high-throughput LLM serving by supplying a flexible, high-performance attention backend that can be dropped into existing frameworks. The emphasis on composable formats and JIT templates addresses real heterogeneity in production workloads, and the CUDAGraph compatibility is a notable strength for static-graph serving stacks.
major comments (2)
- [Evaluation] Evaluation section: the reported 29-69% inter-token latency reductions and other headline figures are given without isolating JIT compilation overhead, template-switch costs, or dynamic re-scheduling latency under variable arrival rates and batch sizes; these unmeasured costs could directly erode the claimed gains when the system is placed under production-like request patterns.
- [Scheduling] Scheduling description: the load-balanced scheduler is asserted to maintain CUDAGraph compatibility while handling dynamic user requests, yet no concrete mechanism, pseudocode, or timing breakdown is supplied showing how static graph capture is preserved across frequent template changes or request heterogeneity.
minor comments (1)
- [Abstract] Abstract: grammatical error ('FlashInfer have been integrated' should read 'FlashInfer has been integrated').
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our paper. We address each major comment below and have made revisions to incorporate additional details and measurements as suggested.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported 29-69% inter-token latency reductions and other headline figures are given without isolating JIT compilation overhead, template-switch costs, or dynamic re-scheduling latency under variable arrival rates and batch sizes; these unmeasured costs could directly erode the claimed gains when the system is placed under production-like request patterns.
Authors: The headline figures are derived from end-to-end benchmarks that incorporate all system components, including JIT compilation and dynamic scheduling under realistic workloads. To address the concern more explicitly, we will revise the evaluation section to include a new subsection with microbenchmarks that isolate the JIT overhead, template-switch costs, and re-scheduling latency across different arrival rates and batch sizes. This will confirm that these costs do not significantly erode the reported gains. revision: yes
-
Referee: [Scheduling] Scheduling description: the load-balanced scheduler is asserted to maintain CUDAGraph compatibility while handling dynamic user requests, yet no concrete mechanism, pseudocode, or timing breakdown is supplied showing how static graph capture is preserved across frequent template changes or request heterogeneity.
Authors: We agree that more detail is needed. In the revised manuscript, we will expand the scheduling section to include a concrete description of the mechanism, pseudocode for the load-balanced scheduler, and an explanation of how static graph capture is maintained (e.g., by capturing graphs for a set of common templates and using a dynamic dispatcher for heterogeneous requests). We will also add a timing breakdown to quantify the costs of template changes. revision: yes
Circularity Check
No circularity: engineering artifact validated by direct measurements
full rationale
The paper describes an implementation of attention kernels using block-sparse KV-cache formats, composable layouts, JIT-compiled templates, and a load-balanced scheduler compatible with CUDAGraph. All performance claims (29-69% latency reductions, etc.) rest on empirical wall-clock timings from kernel-level and end-to-end benchmarks against existing frameworks. No equations, fitted parameters, or derivations are presented that could reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
-
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
LLM-Emu is a serving-native emulator for vLLM that replaces GPU execution with profile-driven latency sampling and achieves under 5% error on TPOT, ITL, E2E latency, and throughput across multiple models, GPUs, and workloads.
-
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
-
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
-
Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
Attention sparsity in video DiTs is an input-stable layer-wise property, enabling offline profiling and online bidirectional QK co-clustering for up to 1.93x speedup with PSNR up to 29 dB.
-
VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
-
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kern...
-
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
-
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x fas...
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2004. 05150. Buluç, A., Fineman, J. T., Frigo, M., Gilbert, J. R., and Leiserson, C. E. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using com- pressed sparse blocks. In auf der Heide, F. M. and Bender, M. A. (eds.), SPAA 2009: Proceedings of the 21st Annual ACM Symposium on Parallelism in Algo- rithms...
-
[2]
URL https: //doi.org/10.1145/3458817.3476182
doi: 10.1145/3458817.3476182. URL https: //doi.org/10.1145/3458817.3476182. Chen, Z., May, A., Svirschevski, R., Huang, Y ., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scal- able, robust, and hardware-aware speculative decoding. CoRR, abs/2402.12374, 2024. doi: 10.48550/ARXIV . 2402.12374. URL https://doi.org/10.48550/ arXiv.2402.12374. Chiang, W.-L., L...
-
[5]
org/paper_files/paper/2024/file/ 5321b1dabcd2be188d796c21b733e8c7- Paper-Conference.pdf
URL https://proceedings.mlsys. org/paper_files/paper/2024/file/ 5321b1dabcd2be188d796c21b733e8c7- Paper-Conference.pdf. Im, E., Yelick, K. A., and Vuduc, R. W. Sparsity: Op- timization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135–158, 2004. doi: 10.1177/1094342004041296. URL https:// doi.org/10.1177/1094342004041296....
-
[7]
Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak
doi: 10.1109/SC41404.2022.00042. URL https: //doi.org/10.1109/SC41404.2022.00042. Lin, C., Han, Z., Zhang, C., Yang, Y ., Yang, F., Chen, C., and Qiu, L. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24) , pp. 929–945, Santa Clara, CA, July
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00042 2022
-
[9]
doi: 10.1145/55364.55378. URL https://doi. org/10.1145/55364.55378. MLC Community. Optimizing and characteriz- ing high-throughput low-latency LLM infer- ence in MLCEngine, Oct 2024. URL https: //blog.mlc.ai/2024/10/10/optimizing- and-characterizing-high-throughput- low-latency-llm-inference . [Online; ac- cessed April 23, 2025]. Mostafa, H. Sequential ag...
-
[10]
Block-Sparse Recurrent Neural Networks
URL https://proceedings.mlsys. org/paper_files/paper/2022/hash/ 1d781258d409a6efc66cd1aa14a1681c- Abstract.html. Narang, S., Undersander, E., and Diamos, G. F. Block- sparse recurrent neural networks. CoRR, abs/1711.02782,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
URL http://arxiv.org/abs/1711. 02782. Nguyen, V ., Carilli, M., Eryilmaz, S. B., Singh, V ., Lin, M., Gimelshein, N., Desmaison, A., and Yang, E. Ac- celerating PyTorch with CUDA Graphs. https: //pytorch.org/blog/accelerating- pytorch-with-cuda-graphs/ , 2021. [Ac- cessed 19-10-2024]. NVIDIA. FasterTransformer. https://github.com/ NVIDIA/FasterTransformer...
work page 2021
-
[12]
URL https: //doi.org/10.1145/3572848.3577479
doi: 10.1145/3572848.3577479. URL https: //doi.org/10.1145/3572848.3577479. Ozen, G. Nvdsl: Simplifying tensor cores with python- driven mlir metaprogramming. In Efficient Systems for Foundation Models (ES-FoMo) Workshop at ICML 2024, 2024. Ponnusamy, R., Saltz, J. H., and Choudhary, A. N. Run- time compilation techniques for data partitioning and communi...
-
[13]
ACM, 1993. doi: 10.1145/169627.169752. URL https://doi.org/10.1145/169627.169752. Prabhu, R., Nayak, A., Mohan, J., Ramjee, R., and Panwar, A. vattention: Dynamic memory management for serving llms without pagedattention, 2024. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. I...
-
[16]
SGLang: Efficient Execution of Structured Language Model Programs
URL https://proceedings.mlsys. org/paper_files/paper/2022/hash/ b559156047e50cf316207249d0b5a6c5- Abstract.html. Zheng, L., Chiang, W., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. In Oh, A., Naumann, T., Globerson, A., ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
-
[17]
Advanced NVIDIA GPU Features. CUTLASS supports specialized GPU capabilities such as warp- specialization (NVIDIA, 2024a) and TMA instruc- tions (NVIDIA, 2022), which are experimental or un- supported in Triton at this moment
work page 2022
-
[18]
Fine-Grained Kernel Optimization. While Triton provides tile-level abstractions, CUDA/CUTLASS af- fords finer control over thread-level registers. This flex- ibility simplifies incorporating low-level optimizations (e.g., PTX intrinsics) directly into our JIT templates, which is more challenging in Triton. Our load-balancing scheduler design (Section 3.3....
work page 2022
-
[19]
on different attention variants using the Atten- tionGym (PyTorch-Labs, 2024) benchmark on NVIDIA H100 80GB SXM. We evaluated with batch size16, number of heads 16 and head dim 128, the CUDA version and the Triton version were fixed to 12.4 and 3.2, respectivelyre- spectively. Tables 1 to 4 show the performance of FlashInfer and FlexAttention in TFLOPS/s,...
work page 2024
-
[20]
“RR” in the tables means request rate. G.4 vLLM Integration Evaluation We compare the vLLM with FlashInfer backend and its default backend with a fixed request rate of 16, reporting throughput (tokens/s), inter-token latency (ITL, ms), and time-to-first-token (TTFT, ms) in Table 8. FlashInfer re- duces ITL by aroudn 13% using fp8 KV-cache, but heavy Pytho...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.