{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:LK3AAFPBRF6FDZR2UNQ677ROMZ","short_pith_number":"pith:LK3AAFPB","schema_version":"1.0","canonical_sha256":"5ab60015e1897c51e63aa361effe2e666a536eca4b29df28eb889ec3d70dd7a7","source":{"kind":"arxiv","id":"2308.16369","version":1},"attestation_state":"computed","paper":{"title":"SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost.","cross_cats":["cs.DC"],"primary_cat":"cs.LG","authors_text":"Amey Agrawal, Ashish Panwar, Bhargav S. Gulavani, Jayashree Mohan, Nipun Kwatra, Ramachandran Ramjee","submitted_at":"2023-08-31T00:03:02Z","abstract_excerpt":"Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.\n  We present SARATHI to address these challenges. SARATHI employs chunke"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2308.16369","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2023-08-31T00:03:02Z","cross_cats_sorted":["cs.DC"],"title_canon_sha256":"52013df046821f6fa51ac4289806767fcc03790d923841a9e0b1f85213776b67","abstract_canon_sha256":"466bf7c6ea41511e785a758ea569c17066f4dacb24704504de9573d6d0ed8b1e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.830683Z","signature_b64":"GtdnUEEidREP3/lpqNFBYXy1hcX5WJydt9VeMIXDW28iT4pgSRzbJUcAkrLr8SB9w+31VMfUKMBItm4/mG5JAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5ab60015e1897c51e63aa361effe2e666a536eca4b29df28eb889ec3d70dd7a7","last_reissued_at":"2026-05-17T23:38:48.830035Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.830035Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost.","cross_cats":["cs.DC"],"primary_cat":"cs.LG","authors_text":"Amey Agrawal, Ashish Panwar, Bhargav S. Gulavani, Jayashree Mohan, Nipun Kwatra, Ramachandran Ramjee","submitted_at":"2023-08-31T00:03:02Z","abstract_excerpt":"Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.\n  We present SARATHI to address these challenges. SARATHI employs chunke"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That chunked prefills can be performed without accuracy loss or extra memory overhead and that decode requests can be freely mixed into the same batch as a prefill chunk while preserving correct autoregressive generation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SARATHI uses chunked prefills and decode-maximal batching to let decode steps ride along with prefill compute, delivering up to 10x higher decode throughput and 1.91x end-to-end throughput on models including LLaMA-13B and GPT-3.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4295394de828f488a885f09e57676e686064d461654e320359bc358952e4f7d3"},"source":{"id":"2308.16369","kind":"arxiv","version":1},"verdict":{"id":"43b09fc4-e42c-44c2-948b-45d39053332d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T06:27:19.762524Z","strongest_claim":"For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.","one_line_summary":"SARATHI uses chunked prefills and decode-maximal batching to let decode steps ride along with prefill compute, delivering up to 10x higher decode throughput and 1.91x end-to-end throughput on models including LLaMA-13B and GPT-3.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That chunked prefills can be performed without accuracy loss or extra memory overhead and that decode requests can be freely mixed into the same batch as a prefill chunk while preserving correct autoregressive generation.","pith_extraction_headline":"SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost."},"references":{"count":48,"sample":[{"doi":"","year":null,"title":"https://aws.amazon.com/ codewhisperer/","work_id":"8ceb9062-1081-4032-823d-82de237e4f51","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"https://claude.ai","work_id":"f42d5c73-87c4-4047-8114-d692921e1e62","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"https://www.bing.com/chat","work_id":"9b6158d3-54e7-4e29-bd55-9fd55925290c","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"https://character.ai","work_id":"53e8cd23-a2da-4851-ba9c-d27d179df274","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"https://chat.openai.com","work_id":"1d52047a-4bbb-4d45-8130-15e3ce4a1d05","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":48,"snapshot_sha256":"31aa9d4c1bfd7cb41b173b3bfe228e77ee0333d8349cb9ea21f33e46477b5305","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c51d576addadf81f81c2c088dc4ed515d4dfdb28872d193ba34273b4bf8ca988"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2308.16369","created_at":"2026-05-17T23:38:48.830136+00:00"},{"alias_kind":"arxiv_version","alias_value":"2308.16369v1","created_at":"2026-05-17T23:38:48.830136+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2308.16369","created_at":"2026-05-17T23:38:48.830136+00:00"},{"alias_kind":"pith_short_12","alias_value":"LK3AAFPBRF6F","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"LK3AAFPBRF6FDZR2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"LK3AAFPB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2605.23389","citing_title":"AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2412.03594","citing_title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2504.11320","citing_title":"Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2603.10726","citing_title":"PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21312","citing_title":"Frontier: Towards Comprehensive and Accurate LLM Inference Simulation","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19049","citing_title":"KVBuffer: IO-aware Serving for Linear Attention","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18535","citing_title":"Beyond Scaling: Agents Are Heading to the Edge","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19593","citing_title":"Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19775","citing_title":"Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16839","citing_title":"CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02960","citing_title":"MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2410.10819","citing_title":"DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2511.03092","citing_title":"SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09427","citing_title":"ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14294","citing_title":"A Survey on Efficient Inference for Large Language Models","ref_index":282,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00831","citing_title":"GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08151","citing_title":"SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11744","citing_title":"Training-Inference Consistent Segmented Execution for Long-Context LLMs","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11999","citing_title":"The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27476","citing_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08151","citing_title":"SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26103","citing_title":"AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24820","citing_title":"Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11001","citing_title":"Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2409.19256","citing_title":"HybridFlow: A Flexible and Efficient RLHF Framework","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ","json":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ.json","graph_json":"https://pith.science/api/pith-number/LK3AAFPBRF6FDZR2UNQ677ROMZ/graph.json","events_json":"https://pith.science/api/pith-number/LK3AAFPBRF6FDZR2UNQ677ROMZ/events.json","paper":"https://pith.science/paper/LK3AAFPB"},"agent_actions":{"view_html":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ","download_json":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ.json","view_paper":"https://pith.science/paper/LK3AAFPB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2308.16369&json=true","fetch_graph":"https://pith.science/api/pith-number/LK3AAFPBRF6FDZR2UNQ677ROMZ/graph.json","fetch_events":"https://pith.science/api/pith-number/LK3AAFPBRF6FDZR2UNQ677ROMZ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ/action/storage_attestation","attest_author":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ/action/author_attestation","sign_citation":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ/action/citation_signature","submit_replication":"https://pith.science/pith/LK3AAFPBRF6FDZR2UNQ677ROMZ/action/replication_record"}},"created_at":"2026-05-17T23:38:48.830136+00:00","updated_at":"2026-05-17T23:38:48.830136+00:00"}