hub

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu · 2023 · cs.LG · arXiv 2309.06180

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open full Pith review browse 35 citing papers arXiv PDF

abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

claims ledger

abstract High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system

co-cited works

representative citing papers

MeMo: Memory as a Model

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to retrieval noise.

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Fine-tuned LLMs trained with reinforcement learning using verifiable rewards produce floor plans that satisfy connectivity and numerical constraints, outperforming prior methods with at least 94% relative improvement in compatibility.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

cs.DC · 2026-04-28 · unverdicted · novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

Neural Garbage Collection: Learning to Forget while Learning to Reason

cs.LG · 2026-04-20 · conditional · novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

cs.MA · 2026-05-09 · unverdicted · novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

cs.DC · 2026-04-13 · unverdicted · novelty 6.0

R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.

Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels

cs.DC · 2026-04-09 · unverdicted · novelty 6.0

Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.

MemFactory: Unified Inference & Training Framework for Agent Memory

cs.CL · 2026-03-31 · unverdicted · novelty 6.0

MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

cs.CL · 2023-10-17 · unverdicted · novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

cs.CE · 2026-05-12 · unverdicted · novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

An Executable Benchmarking Suite for Tool-Using Agents

cs.SE · 2026-05-10 · unverdicted · novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.

How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

cs.SE · 2026-05-06 · conditional · novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

citing papers explorer

Showing 35 of 35 citing papers.

MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 68 · internal anchor
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to retrieval noise.
Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards cs.CL · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Fine-tuned LLMs trained with reinforcement learning using verifiable rewards produce floor plans that satisfy connectivity and numerical constraints, outperforming prior methods with at least 94% relative improvement in compatibility.
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding cs.DC · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 14 · internal anchor
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes cs.LG · 2026-05-09 · unverdicted · none · ref 10 · 2 links · internal anchor
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration cs.DC · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training cs.LG · 2026-04-23 · unverdicted · none · ref 132 · internal anchor
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Neural Garbage Collection: Learning to Forget while Learning to Reason cs.LG · 2026-04-20 · conditional · none · ref 7 · internal anchor
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 14 · internal anchor
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models cs.CL · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 55 · internal anchor
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents cs.MA · 2026-05-09 · unverdicted · none · ref 74 · internal anchor
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 23 · internal anchor
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon cs.LG · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search cs.DC · 2026-04-13 · unverdicted · none · ref 35 · internal anchor
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels cs.DC · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.
MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL · 2026-03-31 · unverdicted · none · ref 5 · internal anchor
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 20 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 39 · internal anchor
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
An Executable Benchmarking Suite for Tool-Using Agents cs.SE · 2026-05-10 · unverdicted · none · ref 16 · internal anchor
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study cs.SE · 2026-05-06 · conditional · none · ref 10 · internal anchor
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 42 · internal anchor
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 19 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 41 · internal anchor
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference cs.PF · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 18 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Hierarchical vs. Flat Iteration in Shared-Weight Transformers cs.CL · 2026-04-15 · unverdicted · none · ref 26 · internal anchor
Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation cs.CY · 2026-03-25 · conditional · none · ref 9 · internal anchor
An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 68 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 43 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production cs.LG · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losing general capabilities.

Efficient Memory Management for Large Language Model Serving with PagedAttention

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer