{"paper":{"title":"Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Chengyue Wu, Enze Xie, Hao Zhang, Ligeng Zhu, Ping Luo, Shizhe Diao, Shuchen Xue, Song Han, Zhijian Liu","submitted_at":"2025-05-28T17:39:15Z","abstract_excerpt":"Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of gen"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6× throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the block-wise approximate KV cache introduces only negligible performance drop and that a single confidence threshold can be chosen to preserve generation quality across benchmarks without post-hoc per-task retuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f24406c2f002b1a64e35976b50236fbcc4633fac5ca074731963f35337f28e30"},"source":{"id":"2505.22618","kind":"arxiv","version":3},"verdict":{"id":"e45f5b97-466a-4ff5-bba7-06dba8cc791c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T04:22:39.565036Z","strongest_claim":"Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6× throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models.","one_line_summary":"Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the block-wise approximate KV cache introduces only negligible performance drop and that a single confidence threshold can be chosen to preserve generation quality across benchmarks without post-hoc per-task retuning.","pith_extraction_headline":"Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel."},"references":{"count":44,"sample":[{"doi":"","year":2025,"title":"Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov","work_id":"a7d122b4-332a-4b4e-b20f-261ba8497fa0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Structured denoising diffusion models in discrete state-spaces","work_id":"73976884-b26f-4171-a159-b0f0a8b9ce9e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"A continuous time framework for discrete denoising models","work_id":"f67f008a-e2de-44f8-a5a0-6dd58dc2befd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Fast sampling via de-randomization for discrete diffusion models","work_id":"19c1e430-1409-412d-b13e-f87d1e5714a7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Discrete flow matching","work_id":"d3810acb-2472-492b-b3b2-2cefdf738e9c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":44,"snapshot_sha256":"01b65f53997ee7bc50ebae18ef6441d398d39a42222c7c90bd024da7edd75c84","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d77a8a9c1f6ad21895b1a43885db0c963b200bcfefb761bd05082ee3cdeef337"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}