{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:R5I3O3BKJRI5W7L5DDVIWJLSCQ","short_pith_number":"pith:R5I3O3BK","schema_version":"1.0","canonical_sha256":"8f51b76c2a4c51db7d7d18ea8b25721415869cc95e7906d90d5fba833ac4d882","source":{"kind":"arxiv","id":"2305.05920","version":3},"attestation_state":"computed","paper":{"title":"Fast Distributed Inference Serving for Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.","cross_cats":["cs.DC"],"primary_cat":"cs.LG","authors_text":"Bingyang Wu, Fangyue Liu, Gang Huang, Shengyu Liu, Xin Jin, Xuanzhe Liu, Yinmin Zhong, Yuanhang Sun, Zili Zhang","submitted_at":"2023-05-10T06:17:50Z","abstract_excerpt":"Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency.\n  We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2305.05920","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2023-05-10T06:17:50Z","cross_cats_sorted":["cs.DC"],"title_canon_sha256":"11d47b641c181c272ea0ee2eff1e59d151b2e50a675352094028c329f8712803","abstract_canon_sha256":"78cdf56d7fa54d739f556b8432c47f660915962d0d52e492c2ac1e70b807618a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.250276Z","signature_b64":"Wb3Q11wq+G1cSbuW2tZXGn+Y43OnnNqDxorg7DSmWCe9Uapm2FC57gCaMg+7iXqpW54u1MMZwp45VnRtSY3tDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"8f51b76c2a4c51db7d7d18ea8b25721415869cc95e7906d90d5fba833ac4d882","last_reissued_at":"2026-05-17T23:38:14.249636Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.249636Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Fast Distributed Inference Serving for Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.","cross_cats":["cs.DC"],"primary_cat":"cs.LG","authors_text":"Bingyang Wu, Fangyue Liu, Gang Huang, Shengyu Liu, Xin Jin, Xuanzhe Liu, Yinmin Zhong, Yuanhang Sun, Zili Zhang","submitted_at":"2023-05-10T06:17:50Z","abstract_excerpt":"Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency.\n  We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That token-level preemption and the skip-join MLFQ assignment based on input length incur low enough overhead to deliver the reported gains without hidden costs in real workloads.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FastServe adds token-level preemption and a skip-join MLFQ scheduler to LLM serving, delivering up to 31.4x higher throughput than vLLM at equivalent average and tail latency.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4feae767686b215ed42f1ea0dfcbef5ca22cb18e9277d8dea078c63a91932402"},"source":{"id":"2305.05920","kind":"arxiv","version":3},"verdict":{"id":"9a642a36-4920-4c52-8bb8-f59482e58f7c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:17:21.588710Z","strongest_claim":"experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.","one_line_summary":"FastServe adds token-level preemption and a skip-join MLFQ scheduler to LLM serving, delivering up to 31.4x higher throughput than vLLM at equivalent average and tail latency.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That token-level preemption and the skip-join MLFQ assignment based on input length incur low enough overhead to deliver the reported gains without hidden costs in real workloads.","pith_extraction_headline":"FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed."},"references":{"count":59,"sample":[{"doi":"","year":2022,"title":"Introducing ChatGPT","work_id":"a5efa48c-9007-406c-b6d2-be938cb1c3ff","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"ChatGPT sets record for fastest-growing user base","work_id":"cedfe6dd-223f-4e96-8a8e-f07f315ad0b5","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Reinventing search with a new ai-powered bing and edge, your copilot for the web","work_id":"25bd9eda-e0dd-4674-96c1-16b8c7daabd2","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Our next-generation model: Gemini 1.5","work_id":"9cb11add-3054-4d6d-aa1c-8f99f011a10a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Introducing the next generation of Claude","work_id":"7c7d9cf6-059c-4acc-95c5-cd733b780dee","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":59,"snapshot_sha256":"2f0e5cba6fd1e55d264b3b89675f2fb3aef7066f184db32442194bfa57886bdd","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.05920","created_at":"2026-05-17T23:38:14.249736+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.05920v3","created_at":"2026-05-17T23:38:14.249736+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.05920","created_at":"2026-05-17T23:38:14.249736+00:00"},{"alias_kind":"pith_short_12","alias_value":"R5I3O3BKJRI5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"R5I3O3BKJRI5W7L5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"R5I3O3BK","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2605.23389","citing_title":"AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2505.09999","citing_title":"ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09472","citing_title":"WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2512.19179","citing_title":"CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2601.20309","citing_title":"SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21312","citing_title":"Frontier: Towards Comprehensive and Accurate LLM Inference Simulation","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20863","citing_title":"PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19593","citing_title":"Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19729","citing_title":"Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2511.02230","citing_title":"Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2504.15965","citing_title":"From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs","ref_index":112,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09427","citing_title":"ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2602.09725","citing_title":"Efficient Remote KV Cache Reuse with GPU-native Video Codec","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14294","citing_title":"A Survey on Efficient Inference for Large Language Models","ref_index":280,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00831","citing_title":"GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27960","citing_title":"Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27476","citing_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08581","citing_title":"PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09104","citing_title":"Token Economics for LLM Agents: A Dual-View Study from Computing and Economics","ref_index":178,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09735","citing_title":"KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08639","citing_title":"ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06113","citing_title":"Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22906","citing_title":"Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities","ref_index":166,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04595","citing_title":"A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07144","citing_title":"Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics","ref_index":53,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ","json":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ.json","graph_json":"https://pith.science/api/pith-number/R5I3O3BKJRI5W7L5DDVIWJLSCQ/graph.json","events_json":"https://pith.science/api/pith-number/R5I3O3BKJRI5W7L5DDVIWJLSCQ/events.json","paper":"https://pith.science/paper/R5I3O3BK"},"agent_actions":{"view_html":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ","download_json":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ.json","view_paper":"https://pith.science/paper/R5I3O3BK","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.05920&json=true","fetch_graph":"https://pith.science/api/pith-number/R5I3O3BKJRI5W7L5DDVIWJLSCQ/graph.json","fetch_events":"https://pith.science/api/pith-number/R5I3O3BKJRI5W7L5DDVIWJLSCQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ/action/storage_attestation","attest_author":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ/action/author_attestation","sign_citation":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ/action/citation_signature","submit_replication":"https://pith.science/pith/R5I3O3BKJRI5W7L5DDVIWJLSCQ/action/replication_record"}},"created_at":"2026-05-17T23:38:14.249736+00:00","updated_at":"2026-05-17T23:38:14.249736+00:00"}