BurstGPT: A real-world workload dataset to optimize LLM serving systems

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu · 2024 · arXiv 2401.17644

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

cs.DC · 2026-05-19 · unverdicted · novelty 6.0

GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

cs.DC · 2025-12-10 · unverdicted · novelty 6.0

WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

A Techno-Economic Framework for Cost Modeling and Revenue Opportunities in Open and Programmable AI-RAN

cs.NI · 2026-03-30 · unverdicted · novelty 5.0 · 2 refs

Techno-economic framework shows that GPU AI-RAN deployments can offset extra costs via AI revenue for up to 8x ROI across scenarios with varying token depreciation, demand, and GPU densities.

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

cs.DC · 2025-05-15 · unverdicted · novelty 5.0

ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.

ADAPT: A Self-Calibrating Proactive Autoscaler for Container Orchestration

cs.DC · 2026-05-15 · unverdicted · novelty 4.0

ADAPT uses an EWMA estimator for cold-start durations to set a dynamic horizon in an MPC-based proactive autoscaler, achieving under 5% SLA violations with MPC+LSTM across tested workloads versus higher rates for HPA and MPC+Prophet.

citing papers explorer

Showing 9 of 9 citing papers.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 20 · 2 links
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads cs.LG · 2026-01-29 · unverdicted · none · ref 9
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems cs.DC · 2026-05-19 · unverdicted · none · ref 36
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 72 · 2 links
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving cs.CL · 2026-04-09 · unverdicted · none · ref 26
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving cs.DC · 2025-12-10 · unverdicted · none · ref 23
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
A Techno-Economic Framework for Cost Modeling and Revenue Opportunities in Open and Programmable AI-RAN cs.NI · 2026-03-30 · unverdicted · none · ref 34 · 2 links
Techno-economic framework shows that GPU AI-RAN deployments can offset extra costs via AI revenue for up to 8x ROI across scenarios with varying token depreciation, demand, and GPU densities.
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production cs.DC · 2025-05-15 · unverdicted · none · ref 45
ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.
ADAPT: A Self-Calibrating Proactive Autoscaler for Container Orchestration cs.DC · 2026-05-15 · unverdicted · none · ref 10
ADAPT uses an EWMA estimator for cold-start durations to set a dynamic horizon in an MPC-based proactive autoscaler, achieving under 5% SLA violations with MPC+LSTM across tested workloads versus higher rates for HPA and MPC+Prophet.

BurstGPT: A real-world workload dataset to optimize LLM serving systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer