super hub Mixed citations

Efficient Memory Management for Large Language Model Serving with PagedAttention

Cody Hao Yu, Lianmin Zheng, Siyuan Zhuang, Woosuk Kwon, Ying Sheng, Zhuohan Li · 2023 · cs.LG · arXiv 2309.06180

Mixed citation behavior. Most common role is background (62%).

102 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 102 citing papers more from Cody Hao Yu arXiv PDF

abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 4 dataset 2

citation-polarity summary

background 10 use method 4 use dataset 2

claims ledger

abstract High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system

authors

Cody Hao Yu Lianmin Zheng Siyuan Zhuang Woosuk Kwon Ying Sheng Zhuohan Li

co-cited works

representative citing papers

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

Alignment Defends LLMs from Property Inference Attacks

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

MRMMIA: Membership Inference Attacks on Memory in Chat Agents

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

MRMMIA is a multi-recall-probe membership inference attack that extracts signals from chat agent memory and outperforms baselines in black-, gray-, and white-box settings.

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

CutVerse benchmark evaluates GUI agents on 186 complex media post-production tasks in seven apps and reports 36% success rate for existing models.

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Fine-tuned LLMs trained with reinforcement learning using verifiable rewards produce floor plans that satisfy connectivity and numerical constraints, outperforming prior methods with at least 94% relative improvement in compatibility.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

cs.DC · 2026-04-28 · unverdicted · novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

Neural Garbage Collection: Learning to Forget while Learning to Reason

cs.LG · 2026-04-20 · conditional · novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

citing papers explorer

Showing 9 of 9 citing papers after filters.

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing cs.CV · 2026-05-19 · unverdicted · none · ref 22 · internal anchor
CutVerse benchmark evaluates GUI agents on 186 complex media post-production tasks in seven apps and reports 36% success rate for existing models.
Harnessing Streaming Video in the Wild cs.CV · 2026-06-07 · unverdicted · none · ref 25 · internal anchor
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics cs.CV · 2026-06-02 · unverdicted · none · ref 12 · internal anchor
STS is a two-stage pruning framework that decouples structural diversity via repulsion sampling from semantic filtering via cross-attention to reduce redundancy in visual tokens for VLMs.
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers cs.CV · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch cs.CV · 2026-01-20 · conditional · none · ref 17 · internal anchor
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation cs.CV · 2026-05-28 · unverdicted · none · ref 24 · internal anchor
SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 42 · internal anchor
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unverdicted · none · ref 4 · 2 links · internal anchor
EdgeFM is an agent-driven VLM inference framework achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin and first end-to-end deployment on Horizon Journey platform.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 18 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Efficient Memory Management for Large Language Model Serving with PagedAttention

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer