OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

D. Balamurugan; Thomas W. Bush

arxiv: 2607.01579 · v1 · pith:UL7I3O54new · submitted 2026-07-02 · 💻 cs.DC

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

D. Balamurugan , Thomas W. Bush This is my paper

Pith reviewed 2026-07-03 06:28 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM servingGPU clusterconfiguration advisorconformal predictionout-of-distribution detectionthroughput predictionutility ranking

0 comments

The pith

OmniPilot predicts aggregate LLM serving throughput on heterogeneous GPUs with 6.2% MAPE and abstains from out-of-distribution configurations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OmniPilot as a launch advisor for selecting GPU type, tensor-parallel degree, and precision when serving LLMs on shared clusters. It uses a conformally calibrated model to predict costs for eight targets and an OOD layer to abstain when requests are outside the support. The advisor ranks options with an economic utility metric based on operator preferences. Evaluations on 460 runs across A100, H100, H200 hardware and four precisions show strong predictive performance and accurate top recommendations, with OOD cases correctly flagged. If true, this would allow more efficient use of cluster resources by avoiding poor configuration choices and unpredictable failures.

Core claim

OmniPilot pairs a conformally calibrated quantile cost model spanning eight serving targets with an out-of-distribution abstention layer. It ranks configurations using an economic utility metric calibrated to an operator's revealed preferences. In evaluations, it predicts aggregate throughput with 6.2% mean absolute percentage error and log-space R² of 0.92, while achieving 95% top-1 accuracy with mean utility regret of 0.003. On OOD holdouts, it flags all unsupported cases despite higher prediction error.

What carries the argument

The conformally calibrated quantile cost model with OOD abstention layer that ranks configurations by economic utility metric.

Load-bearing premise

The 460 benchmark runs form a representative sample of real cluster behavior and the conformal calibration and OOD detection layers function without post-hoc tuning affecting the reported metrics.

What would settle it

Running the advisor on new model families or hardware types not in the benchmarks and observing whether prediction errors stay below 10% on in-support cases or the OOD layer correctly abstains in all high-error cases.

Figures

Figures reproduced from arXiv: 2607.01579 by D. Balamurugan, Thomas W. Bush.

**Figure 1.** Figure 1: OmniPilot pipeline. When a launch request arrives, the cost model analyzes it to predict performance, memory use, and energy consumption, complete with a calibrated margin of error. Based on these predictions, the decision layer recommends the optimal GPU type, tensor-parallel degree, and precision. Once the job runs, its actual performance is logged and fed back into the model through an automated retrain… view at source ↗

**Figure 2.** Figure 2: The inference launch decision space (§2.2, §5.3). GPU kind, tensor-parallel degree, precision, context, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Fixed-task learning curve (§4.4): training rows are subsampled and out-of-sample accuracy is [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Serving large language models (LLMs) on a shared, heterogeneous GPU cluster requires users and operators to select the GPU type, tensor-parallel degree, and precision before committing valuable node-hours. Making these choices is challenging because effective throughput, launch-success rates, and cluster demand and utilization continuously fluctuate. Furthermore, static configuration recipes miss critical interactions: quantization effects depend heavily on the model family, key-value cache pressure creates size-by-precision trade-offs, and failure rates vary by more than twofold across different tensor-parallel degrees. Additionally, cluster resources are frequently constrained by unpredictable hardware failures. To address these challenges, we present \textbf{OmniPilot}, a launch advisor that predicts serving costs for feasible configurations and abstains when requests fall outside its measured support envelope. OmniPilot pairs a conformally calibrated quantile cost model (spanning eight serving targets) with an out-of-distribution (OOD) abstention layer. It ranks configurations using an economic utility metric calibrated to an operator's revealed preferences. In evaluations across 460 benchmark runs on A100, H100, and H200 hardware across four precisions, OmniPilot predicts aggregate throughput with a 6.2\% mean absolute percentage error (MAPE) and a log-space $R^2=0.92$. The advisor achieves 95\% top-1 accuracy with a mean utility regret of just 0.003. When tested on an OOD holdout of unsupported cells, prediction error climbs to 24-46\% and conformal intervals cover 0 of 5 points; however, the abstention layer successfully flags all five as low-confidence. Over time, these OOD scenarios will be integrated into the training dataset to continuously expand the advisor's support envelope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniPilot combines conformal prediction with OOD abstention for LLM serving config selection and reports strong numbers on 460 runs, but the sampling frame for those runs is not detailed enough to judge representativeness.

read the letter

OmniPilot is a launch advisor that predicts throughput and costs across GPU types, tensor-parallel degrees, and precisions for LLM inference, then abstains on out-of-distribution cases using conformal quantiles and a utility ranking.

The new piece is the integrated system that applies these tools to the concrete multi-target serving problem, including interactions like KV cache pressure and TP-dependent failure rates. The reported results are concrete: 6.2% MAPE and log R² of 0.92 on aggregate throughput, 95% top-1 accuracy, and 0.003 mean regret, with the abstention layer correctly flagging the OOD holdout.

That is useful engineering work for operators who need to pick configurations without wasting node-hours. The utility metric tied to revealed preferences is a reasonable way to turn predictions into rankings.

The soft spot is the 460 benchmark runs. The abstract mentions coverage of A100/H100/H200, four precisions, and >2× failure variation by TP degree, yet gives no stratification, replicate counts per cell, or sampling frame. If stable high-throughput cases are over-represented, the conformal intervals and low regret could look better than they would on a fuller distribution of real workloads. The OOD test only checks explicitly unsupported cells, not whether the main set itself is biased.

This paper is for systems people running shared LLM clusters. A reader who needs practical advice on config selection will get value from the approach and the numbers.

It deserves peer review. The core implementation and evaluation are grounded enough to warrant referee feedback on the data collection and generalization questions.

Referee Report

2 major / 2 minor

Summary. The paper presents OmniPilot, an uncertainty-aware launch advisor for LLM serving on heterogeneous GPU clusters (A100/H100/H200). It combines a conformally calibrated quantile regression model over eight serving targets with an OOD abstention layer and ranks configurations via an economic utility metric. On 460 benchmark runs across four precisions, it reports 6.2% MAPE and log-space R²=0.92 for aggregate throughput prediction, 95% top-1 accuracy, and 0.003 mean utility regret; OOD holdouts trigger abstention with 24-46% error and zero coverage.

Significance. If the empirical results generalize beyond the evaluated runs, the approach offers a practical, uncertainty-quantified tool for reducing wasted node-hours in dynamic clusters where static recipes fail due to quantization, KV-cache, and failure-rate interactions. The conformal + OOD design is a clear strength for safe deployment.

major comments (2)

[Evaluation / Results] Evaluation section (implicit in abstract and results): the headline metrics (6.2% MAPE, R²=0.92, 95% top-1 accuracy) rest on an unstratified sample of 460 runs. No replicate counts, cell-wise coverage across (model family, precision, TP degree), or sampling frame are supplied, so it is impossible to assess whether high-throughput stable cells are over-represented relative to rare failure or high-KV-pressure regimes; this directly affects whether the conformal quantiles and utility ranking are load-bearing or artifactual.
[Methods] Methods (model training and calibration): the abstract and description provide no derivation, hyper-parameter search, cross-validation procedure, or independence check between the utility metric and the reported accuracy numbers. Without these, post-hoc exclusions or circular fitting cannot be ruled out, undermining the claim that the 0.003 regret is a genuine out-of-sample property.

minor comments (2)

[Introduction] The claim that failure rates vary by more than twofold across TP degrees is stated without a supporting table or figure; adding the raw per-TP failure counts would strengthen the motivation.
[Approach] Notation for the eight serving targets and the exact form of the economic utility function should be defined explicitly rather than left to the reader to infer from context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation and methods. We address each major comment below and commit to revisions that supply the requested details without altering the reported results.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section (implicit in abstract and results): the headline metrics (6.2% MAPE, R²=0.92, 95% top-1 accuracy) rest on an unstratified sample of 460 runs. No replicate counts, cell-wise coverage across (model family, precision, TP degree), or sampling frame are supplied, so it is impossible to assess whether high-throughput stable cells are over-represented relative to rare failure or high-KV-pressure regimes; this directly affects whether the conformal quantiles and utility ranking are load-bearing or artifactual.

Authors: We agree that the manuscript does not report replicate counts, cell-wise coverage, or an explicit sampling frame for the 460 runs. While the runs span multiple model families, four precisions, and varying tensor-parallel degrees on A100/H100/H200 hardware, the absence of a breakdown prevents readers from evaluating potential over-representation of stable high-throughput cells. We will revise the Evaluation section to include a table documenting the number of runs per (model family, precision, TP degree) cell along with a description of the benchmark collection procedure. This addition will allow assessment of whether the conformal quantiles and 0.003 utility regret rest on balanced coverage. revision: yes
Referee: [Methods] Methods (model training and calibration): the abstract and description provide no derivation, hyper-parameter search, cross-validation procedure, or independence check between the utility metric and the reported accuracy numbers. Without these, post-hoc exclusions or circular fitting cannot be ruled out, undermining the claim that the 0.003 regret is a genuine out-of-sample property.

Authors: The manuscript provides no derivation of the utility metric, no hyper-parameter search details, no cross-validation procedure, and no explicit check that the utility metric is independent of the accuracy and regret calculations. We acknowledge this omission leaves open the possibility of circularity or post-hoc adjustments. We will add a dedicated Methods subsection that (i) derives the economic utility function from operator preferences, (ii) describes the grid search and validation split used for hyper-parameter selection in the quantile regression, (iii) states the cross-validation scheme for conformal calibration, and (iv) confirms that all accuracy and regret figures were computed on a held-out test set disjoint from training and calibration data. Any exclusions will be documented. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on benchmark data

full rationale

The paper describes an empirical advisor built from 460 benchmark runs across hardware and precisions. It reports standard performance metrics (MAPE, R², top-1 accuracy, regret) on those runs plus an OOD holdout. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The conformal quantile model and utility ranking are trained and evaluated on the collected data in the usual supervised manner; the reported numbers are direct empirical outcomes rather than tautological restatements of inputs. Self-citation is absent from the provided text. This is the normal non-circular case for an applied ML systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the quantile model and conformal layer are presumed to contain fitted parameters from the 460 runs but none are named or quantified.

pith-pipeline@v0.9.1-grok · 5845 in / 1167 out tokens · 29099 ms · 2026-07-03T06:28:34.765113+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inProc. 29th ACM Symp. Operating Systems Principles (SOSP), 2023. doi:10.1145/3600006.3613165. arXiv:2309.06180

work page doi:10.1145/3600006.3613165 2023
[2]

Orca: A Distributed Serving System for Transformer-Based Generative Models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inProc. 16th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2022

2022
[3]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2403.02310

work page arXiv 2024
[4]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,

Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” inProc. 17th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2023. arXiv:2302.11665

work page arXiv 2023
[6]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” inProc. Machine Learning and Systems (MLSys), 2024. arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” inProc. Int. Conf. Learning Representations (ICLR), 2023. arXiv:2210.17323. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu et al., “FP8 Formats for Deep Learning,” arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Conformalized Quantile Regression

Y. Romano, E. Patterson, and E. J. Candès, “Conformalized Quantile Regression,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019, pp. 3538–3548. arXiv:1905.03222

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

A. N. Angelopoulos and S. Bates, “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification,” arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Information Value Theory,

R. A. Howard, “Information Value Theory,”IEEE Trans. Systems Science and Cybernetics, vol. 2, no. 1, pp. 22–26, 1966. doi:10.1109/TSSC.1966.300074

work page doi:10.1109/tssc.1966.300074 1966
[12]

doi:10.48550/arXiv.2402.14992 , shorttitle =

F. M. Polo, L. Weber, L. Choshen, Y. Sun, G. Xu, and M. Yurochkin, “tinyBenchmarks: Evaluating LLMs with Fewer Examples,” inProc. Int. Conf. Machine Learning (ICML), 2024. arXiv:2402.14992

work page arXiv 2024
[13]

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training,

G. X. Yu, Y. Gao, P. Golikov, and G. Pekhimenko, “Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training,” inProc. USENIX Annual Technical Conf. (ATC),
[14]

Roofline: An Insightful Visual Performance Model for Multicore Architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[15]

MLPerf In- ference Benchmark,

V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu et al., “MLPerf In- ference Benchmark,” inProc. ACM/IEEE 47th Int. Symp. Computer Architecture (ISCA), 2020. doi:10.1109/ISCA45697.2020.00045. arXiv:1911.02549

work page doi:10.1109/isca45697.2020.00045 2020
[16]

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform,

D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc et al., “TFX: A TensorFlow-Based Production-Scale Machine Learning Platform,” inProc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2017. doi:10.1145/3097983.3098021

work page doi:10.1145/3097983.3098021 2017
[17]

Data Validation for Machine Learning,

E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data Validation for Machine Learning,” inProc. Machine Learning and Systems (MLSys), 2019

2019
[18]

A Continual Learning Survey: Defying Forgetting in Classification Tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A Continual Learning Survey: Defying Forgetting in Classification Tasks,”IEEE Trans. Pattern Analysis and Machine Intelligence, 2021. arXiv:1909.08383

work page arXiv 2021
[19]

Gandiva: Introspective Cluster Scheduling for Deep Learning,

W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, “Gandiva: Introspective Cluster Scheduling for Deep Learning,” inProc. 13th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2018

2018
[20]

Tiresias: A GPU Cluster Manager for Distributed Deep Learning,

J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A GPU Cluster Manager for Distributed Deep Learning,” inProc. 16th USENIX Symp. Networked Systems Design and Implementation (NSDI), 2019

2019
[21]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2401.09670

work page arXiv 2024
[22]

Splitwise: Efficient generative LLM inference using phase splitting

P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inProc. 51st ACM/IEEE Int. Symp. Computer Architecture (ISCA), 2024. arXiv:2311.18677

work page arXiv 2024
[23]

Llumnix: Dynamic Scheduling for Large Language Model Serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y. Li, and W. Lin, “Llumnix: Dynamic Scheduling for Large Language Model Serving,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2406.03243

work page arXiv 2024
[24]

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow,

Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow,” inProc. 30th ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025. doi:10.1145/3669940.3707215. arXiv:2406.01566. 12

work page doi:10.1145/3669940.3707215 2025
[25]

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving,

R. Qin, Z. Li, W. He, M. Zhang, Y. Wu, W. Zheng, and X. Xu, “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving,” inProc. 23rd USENIX Conf. File and Storage Technologies (F AST), 2025. arXiv:2407.00079

work page arXiv 2025
[26]

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,

R. Prabhu, A. Nayak, J. Mohan, R. Ramjee, and A. Panwar, “vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,” inProc. 30th ACM Int. Conf. Architectural Support for Program- ming Languages and Operating Systems (ASPLOS), 2025. doi:10.1145/3669940.3707256. arXiv:2405.04437

work page doi:10.1145/3669940.3707256 2025
[27]

The Kempner AI Cluster,

Kempner Institute for the Study of Natural and Artificial Intelligence, “The Kempner AI Cluster,” Harvard University. https://kempnerinstitute.harvard.edu/kempner-ai-cluster/
[28]

HPC_Tools: Utilities for HPC Cluster Telemetry and Job-Data Collection,

D. Balamurugan, “HPC_Tools: Utilities for HPC Cluster Telemetry and Job-Data Collection,” GitHub repository. https://github.com/dmbala/HPC_Tools
[29]

Jobstats: A Slurm Job Statistics and Monitoring Platform,

Princeton University Research Computing, “Jobstats: A Slurm Job Statistics and Monitoring Platform,” GitHub repository. https://github.com/PrincetonUniversity/jobstats. Appendix A: Supplementary Figures The following figures supplement the main text; each is referenced from its corresponding section, and all numbers trace to the same measured results as t...

[1] [1]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inProc. 29th ACM Symp. Operating Systems Principles (SOSP), 2023. doi:10.1145/3600006.3613165. arXiv:2309.06180

work page doi:10.1145/3600006.3613165 2023

[2] [2]

Orca: A Distributed Serving System for Transformer-Based Generative Models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inProc. 16th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2022

2022

[3] [3]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2403.02310

work page arXiv 2024

[4] [4]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,

Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” inProc. 17th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2023. arXiv:2302.11665

work page arXiv 2023

[6] [6]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” inProc. Machine Learning and Systems (MLSys), 2024. arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” inProc. Int. Conf. Learning Representations (ICLR), 2023. arXiv:2210.17323. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu et al., “FP8 Formats for Deep Learning,” arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Conformalized Quantile Regression

Y. Romano, E. Patterson, and E. J. Candès, “Conformalized Quantile Regression,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019, pp. 3538–3548. arXiv:1905.03222

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

A. N. Angelopoulos and S. Bates, “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification,” arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Information Value Theory,

R. A. Howard, “Information Value Theory,”IEEE Trans. Systems Science and Cybernetics, vol. 2, no. 1, pp. 22–26, 1966. doi:10.1109/TSSC.1966.300074

work page doi:10.1109/tssc.1966.300074 1966

[12] [12]

doi:10.48550/arXiv.2402.14992 , shorttitle =

F. M. Polo, L. Weber, L. Choshen, Y. Sun, G. Xu, and M. Yurochkin, “tinyBenchmarks: Evaluating LLMs with Fewer Examples,” inProc. Int. Conf. Machine Learning (ICML), 2024. arXiv:2402.14992

work page arXiv 2024

[13] [13]

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training,

G. X. Yu, Y. Gao, P. Golikov, and G. Pekhimenko, “Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training,” inProc. USENIX Annual Technical Conf. (ATC),

[14] [14]

Roofline: An Insightful Visual Performance Model for Multicore Architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[15] [15]

MLPerf In- ference Benchmark,

V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu et al., “MLPerf In- ference Benchmark,” inProc. ACM/IEEE 47th Int. Symp. Computer Architecture (ISCA), 2020. doi:10.1109/ISCA45697.2020.00045. arXiv:1911.02549

work page doi:10.1109/isca45697.2020.00045 2020

[16] [16]

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform,

D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc et al., “TFX: A TensorFlow-Based Production-Scale Machine Learning Platform,” inProc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2017. doi:10.1145/3097983.3098021

work page doi:10.1145/3097983.3098021 2017

[17] [17]

Data Validation for Machine Learning,

E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data Validation for Machine Learning,” inProc. Machine Learning and Systems (MLSys), 2019

2019

[18] [18]

A Continual Learning Survey: Defying Forgetting in Classification Tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A Continual Learning Survey: Defying Forgetting in Classification Tasks,”IEEE Trans. Pattern Analysis and Machine Intelligence, 2021. arXiv:1909.08383

work page arXiv 2021

[19] [19]

Gandiva: Introspective Cluster Scheduling for Deep Learning,

W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, “Gandiva: Introspective Cluster Scheduling for Deep Learning,” inProc. 13th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2018

2018

[20] [20]

Tiresias: A GPU Cluster Manager for Distributed Deep Learning,

J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A GPU Cluster Manager for Distributed Deep Learning,” inProc. 16th USENIX Symp. Networked Systems Design and Implementation (NSDI), 2019

2019

[21] [21]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2401.09670

work page arXiv 2024

[22] [22]

Splitwise: Efficient generative LLM inference using phase splitting

P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inProc. 51st ACM/IEEE Int. Symp. Computer Architecture (ISCA), 2024. arXiv:2311.18677

work page arXiv 2024

[23] [23]

Llumnix: Dynamic Scheduling for Large Language Model Serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y. Li, and W. Lin, “Llumnix: Dynamic Scheduling for Large Language Model Serving,” inProc. 18th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2024. arXiv:2406.03243

work page arXiv 2024

[24] [24]

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow,

Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow,” inProc. 30th ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025. doi:10.1145/3669940.3707215. arXiv:2406.01566. 12

work page doi:10.1145/3669940.3707215 2025

[25] [25]

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving,

R. Qin, Z. Li, W. He, M. Zhang, Y. Wu, W. Zheng, and X. Xu, “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving,” inProc. 23rd USENIX Conf. File and Storage Technologies (F AST), 2025. arXiv:2407.00079

work page arXiv 2025

[26] [26]

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,

R. Prabhu, A. Nayak, J. Mohan, R. Ramjee, and A. Panwar, “vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,” inProc. 30th ACM Int. Conf. Architectural Support for Program- ming Languages and Operating Systems (ASPLOS), 2025. doi:10.1145/3669940.3707256. arXiv:2405.04437

work page doi:10.1145/3669940.3707256 2025

[27] [27]

The Kempner AI Cluster,

Kempner Institute for the Study of Natural and Artificial Intelligence, “The Kempner AI Cluster,” Harvard University. https://kempnerinstitute.harvard.edu/kempner-ai-cluster/

[28] [28]

HPC_Tools: Utilities for HPC Cluster Telemetry and Job-Data Collection,

D. Balamurugan, “HPC_Tools: Utilities for HPC Cluster Telemetry and Job-Data Collection,” GitHub repository. https://github.com/dmbala/HPC_Tools

[29] [29]

Jobstats: A Slurm Job Statistics and Monitoring Platform,

Princeton University Research Computing, “Jobstats: A Slurm Job Statistics and Monitoring Platform,” GitHub repository. https://github.com/PrincetonUniversity/jobstats. Appendix A: Supplementary Figures The following figures supplement the main text; each is referenced from its corresponding section, and all numbers trace to the same measured results as t...