pith:NVAFL2CY
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer uses block-sparse KV-cache formats and JIT-compiled attention templates to cut inter-token latency by 29-69% in LLM serving.
arxiv:2501.01005 v2 · 2025-01-02 · cs.DC · cs.AI · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{NVAFL2CYTUBAUAGXMEJWSSWGB6}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
The reported speedups assume that the block-sparse format and JIT templates integrate cleanly with existing serving frameworks without hidden overheads from compilation or scheduling that would appear under production load patterns not tested in the benchmarks.
FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:47.743252Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
6d4055e8589d020a00d76113694ac60fa3a2eb5e7fd631754622ac45fd3f6da5
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/NVAFL2CYTUBAUAGXMEJWSSWGB6 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6d4055e8589d020a00d76113694ac60fa3a2eb5e7fd631754622ac45fd3f6da5
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "ef8f3da93594914f409eba5625144ecd53a2459c3616175e48fa0c82a2e4b035",
"cross_cats_sorted": [
"cs.AI",
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by-sa/4.0/",
"primary_cat": "cs.DC",
"submitted_at": "2025-01-02T02:02:20Z",
"title_canon_sha256": "fec089e8cc02be9f63d92004ce4011f481ba5effd987267046f165313ae1587b"
},
"schema_version": "1.0",
"source": {
"id": "2501.01005",
"kind": "arxiv",
"version": 2
}
}