hub Canonical reference

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee · 2023 · cs.LG · arXiv 2308.16369

Canonical reference. 86% of citing Pith papers cite this work as background.

30 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 30 citing papers arXiv PDF

abstract

Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1

citation-polarity summary

background 6 baseline 1

representative citing papers

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

cs.DC · 2026-05-20 · unverdicted · novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.

KVBuffer: IO-aware Serving for Linear Attention

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

KVBuffer reduces linear attention decoding latency by up to 45% and increases speculative decoding throughput 5x by buffering keys/values for flexible chunked and parallel computation.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

cs.DC · 2026-05-04 · unverdicted · novelty 7.0

Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

cs.DC · 2026-03-26 · unverdicted · novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

Training-Inference Consistent Segmented Execution for Long-Context LLMs

cs.CL · 2026-05-12 · conditional · novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

cs.DC · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.

MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

cs.OS · 2026-04-14 · conditional · novelty 6.0

MARS coordinates heterogeneous GPU-CPU resources for agentic LLM workloads via decoupled admission control and agent-centric KV cache management, delivering up to 5.94x lower latency and 1.87x faster task completion.

Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

cs.OS · 2026-04-09 · unverdicted · novelty 6.0

Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

cs.LG · 2025-05-05 · conditional · novelty 6.0

RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

cs.LG · 2025-04-15 · unverdicted · novelty 6.0

The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

cs.CL · 2024-11-29 · unverdicted · novelty 6.0

BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

cs.DC · 2026-05-19 · unverdicted · novelty 5.0

Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

cs.AI · 2026-05-19 · unverdicted · novelty 5.0

Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.

citing papers explorer

Showing 30 of 30 citing papers.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation cs.DC · 2026-05-20 · unverdicted · none · ref 19 · internal anchor
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
KVBuffer: IO-aware Serving for Linear Attention cs.LG · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
KVBuffer reduces linear attention decoding latency by up to 45% and increases speculative decoding throughput 5x by buffering keys/values for flexible chunked and parallel computation.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 2 · internal anchor
Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving cs.LG · 2026-05-03 · unverdicted · none · ref 2 · 2 links · internal anchor
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 1 · internal anchor
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems cs.CR · 2026-03-11 · unverdicted · none · ref 3 · internal anchor
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 1 · internal anchor
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 2 · internal anchor
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Training-Inference Consistent Segmented Execution for Long-Context LLMs cs.CL · 2026-05-12 · conditional · none · ref 34 · internal anchor
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 1 · 2 links · internal anchor
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving cs.AR · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems cs.OS · 2026-04-14 · conditional · none · ref 3 · internal anchor
MARS coordinates heterogeneous GPU-CPU resources for agentic LLM workloads via decoupled admission control and agent-centric KV cache management, delivering up to 5.94x lower latency and 1.87x faster task completion.
Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees cs.LG · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate cs.OS · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · conditional · none · ref 4 · internal anchor
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints cs.LG · 2025-04-15 · unverdicted · none · ref 1 · internal anchor
The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching cs.CL · 2024-11-29 · unverdicted · none · ref 1 · internal anchor
BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 3 · internal anchor
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles cs.DC · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption cs.AI · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.
Beyond Scaling: Agents Are Heading to the Edge cs.LG · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection cs.CL · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
CompactAttention accelerates chunked-prefill attention via Block-Union KV Selection, delivering up to 2.72x speedup at 128K context on LLaMA-3.1-8B while matching dense accuracy on RULER.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers cs.DC · 2026-05-04 · unverdicted · none · ref 12 · internal anchor
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators cs.AR · 2025-12-10 · unverdicted · none · ref 19 · internal anchor
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 282 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer