Title resolution pending

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica

29 Pith papers cite this work. Polarity classification is still indexing.

29 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

cs.DC · 2026-05-20 · unverdicted · novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

cs.DC · 2026-05-13 · conditional · novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

cs.OS · 2026-05-05 · unverdicted · novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

cs.DC · 2026-04-22 · unverdicted · novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

cs.PF · 2026-04-20 · unverdicted · novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.

AgenTEE: Confidential LLM Agent Execution on Edge Devices

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

cs.DC · 2026-04-14 · unverdicted · novelty 7.0

PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

cs.DC · 2026-03-05 · unverdicted · novelty 7.0

Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

cs.DC · 2026-05-16 · unverdicted · novelty 6.0

ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.

Designing Datacenter Power Delivery Hierarchies for the AI Era

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.

Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

cs.CY · 2026-05-13 · unverdicted · novelty 6.0

Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

cs.DC · 2026-04-03 · unverdicted · novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

cs.DC · 2025-10-22 · unverdicted · novelty 6.0

RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.

eLLM: Elastic Memory Management Framework for Efficient LLM Serving

cs.DC · 2025-06-18 · unverdicted · novelty 6.0

eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.

Can Large Language Models Really Recognize Your Name?

cs.CR · 2025-05-20 · unverdicted · novelty 6.0

LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

SSV: Sparse Speculative Verification for Efficient LLM Inference

cs.OS · 2026-05-19 · unverdicted · novelty 5.0

SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVIDIA H100 GPUs.

citing papers explorer

Showing 29 of 29 citing papers.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation cs.DC · 2026-05-20 · unverdicted · none · ref 30
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics cs.OS · 2026-05-18 · unverdicted · none · ref 39
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support cs.SE · 2026-05-14 · unverdicted · none · ref 20
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving cs.DC · 2026-05-13 · conditional · none · ref 21
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving cs.OS · 2026-05-05 · unverdicted · none · ref 20
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 19
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing cs.PF · 2026-04-20 · unverdicted · none · ref 22
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
AgenTEE: Confidential LLM Agent Execution on Edge Devices cs.CR · 2026-04-20 · unverdicted · none · ref 29
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving cs.DC · 2026-04-14 · unverdicted · none · ref 13
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs cs.DC · 2026-03-05 · unverdicted · none · ref 7
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding cs.DC · 2026-05-20 · unverdicted · none · ref 31
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse cs.DC · 2026-05-16 · unverdicted · none · ref 36
ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.
Designing Datacenter Power Delivery Hierarchies for the AI Era cs.DC · 2026-05-15 · unverdicted · none · ref 31
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact cs.CY · 2026-05-13 · unverdicted · none · ref 28
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference cs.LG · 2026-05-07 · unverdicted · none · ref 18
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 28
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling cs.AI · 2026-04-19 · unverdicted · none · ref 19
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference cs.DC · 2026-04-07 · unverdicted · none · ref 32
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 15
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs cs.DC · 2025-10-22 · unverdicted · none · ref 21
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
eLLM: Elastic Memory Management Framework for Efficient LLM Serving cs.DC · 2025-06-18 · unverdicted · none · ref 16
eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.
Can Large Language Models Really Recognize Your Name? cs.CR · 2025-05-20 · unverdicted · none · ref 37
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 45
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
SSV: Sparse Speculative Verification for Efficient LLM Inference cs.OS · 2026-05-19 · unverdicted · none · ref 18
SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVIDIA H100 GPUs.
Strait: Perceiving Priority and Interference in ML Inference Serving cs.LG · 2026-04-30 · unverdicted · none · ref 60
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS cs.DC · 2026-02-21 · unverdicted · none · ref 18
DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind cs.DC · 2025-08-21 · unverdicted · none · ref 39
TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production cs.DC · 2025-05-15 · unverdicted · none · ref 28
ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.
LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers cs.LG · 2025-09-28 · unverdicted · none · ref 25
PreScope combines a layer-aware activation predictor, cross-layer prefetch scheduling, and asynchronous I/O to deliver 141% higher throughput and 74.6% lower latency for MoE inference on legacy hardware.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer