pith. sign in
Pith Number

pith:NVAFL2CY

pith:2025:NVAFL2CYTUBAUAGXMEJWSSWGB6
not attested not anchored not stored refs resolved

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Arvind Krishnamurthy, Baris Kasikci, Lequn Chen, Luis Ceze, Ruihang Lai, Stephanie Wang, Tianqi Chen, Vinod Grover, Wuwei Lin, Yineng Zhang, Zihao Ye

FlashInfer uses block-sparse KV-cache formats and JIT-compiled attention templates to cut inter-token latency by 29-69% in LLM serving.

arxiv:2501.01005 v2 · 2025-01-02 · cs.DC · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{NVAFL2CYTUBAUAGXMEJWSSWGB6}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

C2weakest assumption

The reported speedups assume that the block-sparse format and JIT templates integrate cleanly with existing serving frameworks without hidden overheads from compilation or scheduling that would appear under production load patterns not tested in the benchmarks.

C3one line summary

FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.

References

14 extracted · 14 resolved · 3 Pith anchors

[1] URL https://arxiv.org/abs/2004. 05150. Buluç, A., Fineman, J. T., Frigo, M., Gilbert, J. R., and Leiserson, C. E. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using com- pr 2004 · doi:10.1145/1583991.1584053
[2] URL https: //doi.org/10.1145/3458817.3476182 2024 · doi:10.1145/3458817.3476182
[5] org/paper_files/paper/2024/file/ 5321b1dabcd2be188d796c21b733e8c7- Paper-Conference.pdf 2024 · doi:10.1177/1094342004041296
[7] El-rec: Efficient large-scale recommendation model training via tensor-train embedding table 2022 · doi:10.1109/sc41404.2022.00042
[9] URL https://doi 2024 · doi:10.1145/55364.55378

Formal links

2 machine-checked theorem links

Cited by

32 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.743252Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

6d4055e8589d020a00d76113694ac60fa3a2eb5e7fd631754622ac45fd3f6da5

Aliases

arxiv: 2501.01005 · arxiv_version: 2501.01005v2 · doi: 10.48550/arxiv.2501.01005 · pith_short_12: NVAFL2CYTUBA · pith_short_16: NVAFL2CYTUBAUAGX · pith_short_8: NVAFL2CY
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/NVAFL2CYTUBAUAGXMEJWSSWGB6 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6d4055e8589d020a00d76113694ac60fa3a2eb5e7fd631754622ac45fd3f6da5
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ef8f3da93594914f409eba5625144ecd53a2459c3616175e48fa0c82a2e4b035",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by-sa/4.0/",
    "primary_cat": "cs.DC",
    "submitted_at": "2025-01-02T02:02:20Z",
    "title_canon_sha256": "fec089e8cc02be9f63d92004ce4011f481ba5effd987267046f165313ae1587b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.01005",
    "kind": "arxiv",
    "version": 2
  }
}