SAGA reduces AI agent task completion time by 1.64x on 64-GPU clusters by scheduling at the full workflow level with execution graphs, affinity batching, and completion-time fairness.
Canonical reference
BurstGPT: A real-world workload dataset to optimize LLM serving systems
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7representative citing papers
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Techno-economic framework shows that GPU AI-RAN deployments can offset extra costs via AI revenue for up to 8x ROI across scenarios with varying token depreciation, demand, and GPU densities.
GPT-4o exhibits daily and weekly periodic fluctuations in performance on a fixed physics task, accounting for about 20% of observed variance.
citing papers explorer
-
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
SAGA reduces AI agent task completion time by 1.64x on 64-GPU clusters by scheduling at the full workflow level with execution graphs, affinity batching, and completion-time fairness.
-
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
-
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
-
The Energy Cost of Execution-Idle in GPU Clusters
Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.
-
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
-
A Techno-Economic Framework for Cost Modeling and Revenue Opportunities in Open and Programmable AI-RAN
Techno-economic framework shows that GPU AI-RAN deployments can offset extra costs via AI revenue for up to 8x ROI across scenarios with varying token depreciation, demand, and GPU densities.
-
Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research
GPT-4o exhibits daily and weekly periodic fluctuations in performance on a fixed physics task, accounting for about 20% of observed variance.