Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
Title resolution pending
29 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVIDIA H100 GPUs.
citing papers explorer
-
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
-
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
-
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
AgenTEE: Confidential LLM Agent Execution on Edge Devices
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
-
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
-
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
-
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
-
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse
ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.
-
Designing Datacenter Power Delivery Hierarchies for the AI Era
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
-
Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact
Google AI Overviews activate on 13.7% of queries overall and 64.7% of questions, cite more credible sources than standard results but omit key information in 11% of claims, and suppress clicks on over half of cited pages that carry ads.
-
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
-
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
-
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
-
eLLM: Elastic Memory Management Framework for Efficient LLM Serving
eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.
-
Can Large Language Models Really Recognize Your Name?
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
-
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
-
SSV: Sparse Speculative Verification for Efficient LLM Inference
SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVIDIA H100 GPUs.
-
Strait: Perceiving Priority and Interference in ML Inference Serving
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
-
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.
-
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.
-
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.
-
LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers
PreScope combines a layer-aware activation predictor, cross-layer prefetch scheduling, and asynchronous I/O to deliver 141% higher throughput and 74.6% lower latency for MoE inference on legacy hardware.