Cost-aware Duration Prediction for Software Upgrades in Datacenters

Aijia Gao; Essam Ewaisha; Henry Hoffmann; Igor Marnat; Michal Sedlak; Thibaud Ryden; Yi Ding

arxiv: 2212.05155 · v2 · pith:LFA4ZSEHnew · submitted 2022-12-10 · 💻 cs.DC · cs.LG

Cost-aware Duration Prediction for Software Upgrades in Datacenters

Yi Ding , Aijia Gao , Thibaud Ryden , Michal Sedlak , Essam Ewaisha , Igor Marnat , Henry Hoffmann This is my paper

Pith reviewed 2026-05-24 09:56 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords software upgradeduration predictiondatacenter maintenancecost-aware modelingscheduling optimizationstraggler mitigationservice level objectives

0 comments

The pith

Acela predicts software upgrade durations in datacenters while accounting for asymmetric misprediction costs to raise scheduling efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Software upgrades in large datacenters require accurate predictions of how long each upgrade will take so that they can be scheduled into limited maintenance windows without disrupting services. The paper frames this as an optimization problem with constraints from service-level objectives and introduces Acela to predict durations while weighing the different costs of overestimating versus underestimating the time. Acela also chooses among models and corrects for cases where a few slow servers skew the estimates upward. When deployed on production systems, this approach allows significantly more upgrades to be scheduled and completed within the same windows. A sympathetic reader would care because more efficient upgrades mean servers stay updated with fewer disruptions and higher overall reliability.

Core claim

The central claim is that a duration prediction system which explicitly models asymmetric costs of prediction errors, selects appropriate models, and mitigates straggler-induced overestimations can raise upgrade window utilization by a factor of 1.25, increase the number of scheduled upgrades by 33 percent and completed upgrades by 41 percent, and cut cancellation rates by a factor of 2.4 when evaluated on Meta's production datacenter systems.

What carries the argument

Acela, the cost-aware duration prediction framework that strategically selects predictive models based on misprediction costs and adjusts for stragglers.

If this is right

More upgrades can be completed without expanding maintenance windows.
Upgrade cancellations due to time overruns decrease substantially.
Service level objectives are met more reliably during upgrade periods.
The overall throughput of the upgrade scheduler increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cost-aware selection logic might improve duration predictions for other datacenter tasks such as data migrations or hardware servicing.
If the framework generalizes, operators could apply it to new upgrade types with minimal additional tuning.
Datacenters with more heterogeneous hardware might require extensions to handle a wider range of straggler behaviors.

Load-bearing premise

That the upgrade characteristics and workloads observed in Meta's datacenters are representative enough that models trained there will achieve similar gains in other environments.

What would settle it

Deploying Acela on upgrade logs from a second large-scale datacenter operator and measuring whether the reported improvements in utilization, throughput, and cancellation rates still appear.

Figures

Figures reproduced from arXiv: 2212.05155 by Aijia Gao, Essam Ewaisha, Henry Hoffmann, Igor Marnat, Michal Sedlak, Thibaud Ryden, Yi Ding.

**Figure 2.** Figure 2: An example of differences between quantile regression (QR) with different quantiles and standard regression (SR) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: CDFs of normalized duration. Median and p99 duration are marked by green circles and red stars. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The quantile losses with di [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: displays the overall workflow of Acela. Acela works in an online fashion: it continually collects data as new maintenance jobs are executed, and retrains models as new data come in. The input of Acela is the user SLO for model hyperparameter tuning, which is at least 95% jobs on the validation set are overpredicted with the highest prediction accuracy in our experiments. Acela includes three components: … view at source ↗

**Figure 6.** Figure 6: Prediction accuracy in MAPE for each method per firmware. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Overprediction rate (OPR) for each method per firmware. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity analyses of MAPE at different quantiles for Acela. 60 70 80 90 100 q (%) 80 100 OPR (%) CPLD 60 70 80 90 100 q (%) 50 100 OPR (%) FLASH 60 70 80 90 100 q (%) 50 100 OPR (%) BIC 60 70 80 90 100 q (%) 92.5 95.0 97.5 OPR (%) BIOS 60 70 80 90 100 q (%) 80 100 OPR (%) NIC 60 70 80 90 100 q (%) 80 100 OPR (%) OPENBMC [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity analyses of overprediction rate (OPR) at di [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly increases efficiency of the existing upgrade scheduler by improving upgrade window utilization by 1.25X, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4X. The code and data sets will be released after paper acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Acela improves Meta's upgrade scheduler enough to finish 41 percent more upgrades, but the gains rest on one site's workload patterns.

read the letter

The headline result is that Acela improves Meta's upgrade scheduler enough to finish 41 percent more upgrades while cutting cancellations by more than half. The gains come from a predictor that explicitly trades off different kinds of errors and from a straggler fix. What stands out as new is the cost-aware model selection inside the framework. Instead of picking the model with lowest average error, Acela chooses based on the asymmetric costs of over- and under-prediction for upgrade windows. They also add a mitigation step for stragglers that would otherwise push estimates too high. The initial characterization of upgrade types and the constrained optimization formulation give a clean way to think about the problem. The production evaluation is solid on its own terms. The numbers are measured on live Meta systems rather than simulations, and the improvements in window utilization and throughput are large enough to matter for operations. Planning to release the code and data sets is the right move and makes the work more credible. The soft spot is scope. Everything is tuned and tested on Meta's upgrade patterns. If another site has different mixes of upgrade types, tighter or looser SLOs, or different straggler behavior, the same model selection logic may not deliver comparable gains. The paper does not show ablations that vary those parameters or tests on additional traces, so the general applicability remains an open question. This paper is aimed at datacenter operators and scheduling researchers who deal with maintenance tasks that have hard time windows. A reader who needs a practical method for raising upgrade efficiency will get concrete ideas and numbers. It deserves a serious referee because the empirical claims are grounded in production data and the approach is reproducible once the code is out, even though the generalization argument needs more support. I would recommend sending it for peer review with a note to the authors to add validation on varied workloads.

Referee Report

1 major / 0 minor

Summary. The manuscript characterizes software upgrade types in datacenters and frames upgrade scheduling as a constrained optimization problem. It introduces Acela, a cost-aware duration prediction framework that strategically selects models while accounting for asymmetric misprediction costs and mitigating straggler-induced overestimations to meet SLOs. Production evaluations at Meta report that Acela improves existing scheduler efficiency by 1.25X in upgrade window utilization, increases scheduled and completed upgrades by 33% and 41%, and reduces cancellation rates by 2.4X. Code and datasets are promised for release.

Significance. If the empirical gains hold under broader conditions, the work addresses an under-explored scheduling domain with direct operational relevance for large-scale datacenters. The explicit handling of cost asymmetry and stragglers, combined with the commitment to release artifacts, would strengthen reproducibility and enable follow-on studies.

major comments (1)

[Evaluation] Evaluation section: The reported gains (1.25X window utilization, +33%/41% scheduled/completed upgrades, 2.4X lower cancellations) rest exclusively on a single Meta production trace. No cross-site experiments, synthetic workloads, or sensitivity analysis varying upgrade-duration distributions, straggler frequency, or misprediction-cost ratios are presented, so it is unclear whether the constrained optimizer would deliver comparable throughput under different workload traits.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the feedback. The evaluation concern is valid and we address it directly below.

read point-by-point responses

Referee: Evaluation section: The reported gains (1.25X window utilization, +33%/41% scheduled/completed upgrades, 2.4X lower cancellations) rest exclusively on a single Meta production trace. No cross-site experiments, synthetic workloads, or sensitivity analysis varying upgrade-duration distributions, straggler frequency, or misprediction-cost ratios are presented, so it is unclear whether the constrained optimizer would deliver comparable throughput under different workload traits.

Authors: We agree the evaluation uses a single production trace from Meta. This trace captures the actual upgrade types, straggler patterns, and cost asymmetries observed in a large-scale operational setting, which is the intended deployment environment. Cross-site experiments are not possible because comparable traces from other operators are unavailable due to proprietary constraints. We will, however, add sensitivity analysis in the revision: using the promised dataset release, we will perturb upgrade-duration distributions, straggler rates, and misprediction-cost ratios and re-run the constrained optimizer to quantify throughput variation. This addresses the robustness question while remaining within the scope of the released artifacts. revision: partial

standing simulated objections not resolved

Cross-site experiments cannot be conducted because production traces from other datacenter operators are not accessible.

Circularity Check

0 steps flagged

No circularity; empirical evaluation on production traces

full rationale

The paper frames upgrade scheduling as a constrained optimization problem and introduces Acela as a cost-aware prediction framework, but all load-bearing claims are empirical measurements (1.25X utilization, +33/41% upgrades, 2.4X lower cancellations) obtained by running the system on Meta production traces. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the reported gains are direct outcomes of deployment rather than reductions of a derivation to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5728 in / 1084 out tokens · 20369 ms · 2026-05-24T09:56:09.233924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Evolve or die: High- availability design principles drawn from googles net- work infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High- availability design principles drawn from googles net- work infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM ’16, page 58–72, New York, NY , USA, 2016. Association for Comput- ing Machinery. ISBN 9781450341936. doi: 10. 1145/2934872...

work page arXiv 2016
[2]

The datacenter as a computer: An introduction to the design of warehouse-scale machines

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013

work page 2013
[3]

zupdate: Updating data center networks with zero loss

Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David Maltz. zupdate: Updating data center networks with zero loss. In Pro- ceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, pages 411–422, 2013

work page 2013
[4]

Ex- plicit path control in commodity data centers: Design and applications

Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. Ex- plicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 15), pages 15–28, 2015

work page 2015
[5]

Gwin, Sathish K Sekar, Sergey N

Parthasarathy Ranganathan, Danner Stodolsky, Je ﬀ Calow, Jeremy Dorfman, Marisabel Guevara Hecht- man, Clint Smullen, Aki Kuusela, Aaron James Laursen, Alex Ramirez, Alvin Adrian Wijaya, Amir Salek, Anna Cheung, Ben Gelb, Brian Fosco, Cho Mon Kyaw, Dake He, David Alexander Munday, David Wickeraad, Devin Persaud, Don Stark, Drew Walton, Elisha Indu- palli,...

work page 2021
[6]

Meet: rack- level pooling based load balancing in datacenter net- works

Jiaqing Dong, Lijuan Tan, Chen Tian, Yuhang Zhou, Yi Wang, Wanchun Dou, and Guihai Chen. Meet: rack- level pooling based load balancing in datacenter net- works. IEEE Transactions on Parallel and Distributed Systems, 2022

work page 2022
[7]

Estimating computation times of data-intensive applications

Shonali Krishnaswamy, Seng Wai Loke, and Arkady Za- slavsky. Estimating computation times of data-intensive applications. IEEE Distributed Systems Online , 5(4), 2004

work page 2004
[8]

Stratus: Cost-aware container scheduling in the public cloud

Andrew Chung, Jun Woo Park, and Gregory R Ganger. Stratus: Cost-aware container scheduling in the public cloud. In Proceedings of the ACM symposium on cloud computing, pages 121–134, 2018

work page 2018
[9]

Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

Carlo Curino, Djellel E Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao. Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

work page 2014
[10]

Network-aware scheduling for data-parallel jobs: Plan when you can

Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. ACM SIGCOMM Computer Communi- cation Review, 45(4):407–420, 2015

work page 2015
[11]

Mor- pheus: Towards automated slos for enterprise clusters

Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Mor- pheus: Towards automated slos for enterprise clusters. 12 In 12th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 16) , pages 117...

work page 2016
[12]

Perforator: eloquent performance mod- els for resource optimization

Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. Perforator: eloquent performance mod- els for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415–427, 2016

work page 2016
[13]

Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Pro- ceedings of the Eleventh European Conference on Com- puter Systems, pages 1–16, 2016

work page 2016
[14]

Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

C˘alin Iorgulescu, Florin Dinu, Aunn Raza, Wajih Ul Hassan, and Willy Zwaenepoel. Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 97–109, 2017

work page 2017
[15]

3sigma: distribution-based cluster scheduling for runtime uncer- tainty

Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A Kozuch, and Gregory R Ganger. 3sigma: distribution-based cluster scheduling for runtime uncer- tainty. In Proceedings of the Thirteenth EuroSys Con- ference, pages 1–17, 2018

work page 2018
[16]

A case for task sampling based learning for cluster job scheduling

Akshay Jajoo, Y Charlie Hu, Xiaojun Lin, and Nan Deng. A case for task sampling based learning for cluster job scheduling. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 19–33, 2022

work page 2022
[17]

Quantile regres- sion

Roger Koenker and Kevin F Hallock. Quantile regres- sion. Journal of economic perspectives, 15(4):143–156, 2001

work page 2001
[18]

The knapsack problem: a survey

Harvey M Salkin and Cornelis A De Kluyver. The knapsack problem: a survey. Naval Research Logistics Quarterly, 22(1):127–144, 1975

work page 1975
[19]

Forecasting, time series, and regression: an applied approach, volume 4

Bruce L Bowerman, Richard T O’Connell, and Anne B Koehler. Forecasting, time series, and regression: an applied approach, volume 4. South-Western Pub, 2005

work page 2005
[20]

Cpld, howpublished = https://github.com/ mikeroyal/cpld-guide,

work page
[21]

com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,

Flash, howpublished = https://www.intel. com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,

work page
[22]

Bic, howpublished = https://github.com/ facebook/openbic,

work page
[23]

Bios, howpublished = https://github.com/ openbios/openbios,

work page
[24]

Nic, howpublished = https://github.com/ netronome/nic-firmware,

work page
[25]

Openbmc, howpublished = https://github.com/ openbmc,

work page
[26]

Lightgbm: A highly eﬃcient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly eﬃcient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017

work page 2017
[27]

Predicting node failure in cloud service systems

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Cheng- gang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the founda- tions of software engineering, pages 480–490, 2018

work page 2018
[28]

Nurd: Negative-unlabeled learning for online datacenter straggler prediction

Yi Ding, Avinash Rao, Hyebin Song, Rebecca Willett, and Henry Hank Hoﬀmann. Nurd: Negative-unlabeled learning for online datacenter straggler prediction. Pro- ceedings of Machine Learning and Systems, 4:190–203, 2022

work page 2022
[29]

Cpr: Composable performance regres- sion for scalable multiprocessor models

Benjamin C Lee, Jamison Collins, Hong Wang, and David Brooks. Cpr: Composable performance regres- sion for scalable multiprocessor models. In 2008 41st IEEE/ACM International Symposium on Microarchitec- ture, pages 270–281. IEEE, 2008

work page 2008
[30]

Generalizable and interpretable learning for conﬁguration extrapolation

Yi Ding, Ahsan Pervaiz, Michael Carbin, and Henry Hoﬀmann. Generalizable and interpretable learning for conﬁguration extrapolation. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of soft- ware engineering, pages 728–740, 2021

work page 2021
[31]

Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud

Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA) , pages 155–168. IEEE, 2022

work page 2022
[32]

Wrangler: Predictable and faster jobs using fewer resources

Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

work page 2014
[33]

Hypermapper: A practical design space ex- ploration framework

Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. Hypermapper: A practical design space ex- ploration framework. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of 13 Computer and Telecommunication Systems (MASCOTS), pages 425–426. IEEE, 2019

work page 2019
[34]

Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018
[35]

Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoﬀs in nvms

Zhaoxia Deng, Lunkai Zhang, Nikita Mishra, Henry Hoﬀmann, and Frederic T Chong. Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoﬀs in nvms. In Proceedings of the 50th Annual IEEE /ACM International Symposium on Microarchitecture, pages 232–244, 2017

work page 2017
[36]

E ﬃciently explor- ing architectural design spaces via predictive model- ing

Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. E ﬃciently explor- ing architectural design spaces via predictive model- ing. ACM SIGOPS Operating Systems Review, 40(5): 195–206, 2006

work page 2006
[37]

Scikit-learn: Machine learning in python

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12: 2825–2830, 2011

work page 2011
[38]

Apollo: Scalable and coordinated scheduling for cloud-scale computing

Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jin- gren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pages 285–300, 2014

work page 2014
[39]

Ix: A protected dataplane operating system for high throughput and low latency

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. Ix: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 14), pages 49–65, 2014. doi: 10.1145/2997641

work page doi:10.1145/2997641 2014
[40]

Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale

Shaohong Li, Xi Wang, Faria Kalim, Xiao Zhang, Sangeetha Abdu Jyothi, Karan Grover, Vasileios Kon- torinis, Nina Narodytska, Owolabi Legunsen, Sreeku- mar Kodakara, et al. Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale. In 14th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 20) , pages 1241–1255, 2020

work page 2020
[41]

Accurate and eﬃ- cient regression modeling for microarchitectural perfor- mance and power prediction

Benjamin C Lee and David M Brooks. Accurate and eﬃ- cient regression modeling for microarchitectural perfor- mance and power prediction. ACM SIGOPS operating systems review, 40(5):185–194, 2006

work page 2006
[42]

Energy-e ﬃcient soft real-time cpu scheduling for mobile multimedia systems

Wanghong Yuan and Klara Nahrstedt. Energy-e ﬃcient soft real-time cpu scheduling for mobile multimedia systems. ACM SIGOPS Operating Systems Review, 37 (5):149–163, 2003

work page 2003
[43]

An approach to performance prediction for parallel applications

Engin Ipek, Bronis R De Supinski, Martin Schulz, and Sally A McKee. An approach to performance prediction for parallel applications. In European Conference on Parallel Processing, pages 196–205. Springer, 2005

work page 2005
[44]

Perceptron-based prefetch ﬁltering

Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V Gratz, and Daniel A Jiménez. Perceptron-based prefetch ﬁltering. In2019 ACM/IEEE 46th Annual Inter- national Symposium on Computer Architecture (ISCA), pages 1–13. IEEE, 2019

work page 2019
[45]

Bit-level perceptron pre- diction for indirect branches

Elba Garza, Samira Mirbagher-Ajorpaz, Tahsin Ahmad Khan, and Daniel A Jiménez. Bit-level perceptron pre- diction for indirect branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architec- ture (ISCA), pages 27–38. IEEE, 2019

work page 2019
[46]

Applying deep learning to the cache replacement problem

Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019

work page 2019
[47]

Learning scheduling algorithms for data processing clus- ters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clus- ters. In Proceedings of the ACM special interest group on data communication, pages 270–288. 2019

work page 2019
[48]

Paragon: Qos-aware scheduling for heterogeneous datacenters

Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77–88, 2013

work page 2013
[49]

Caloree: Learning control for predictable latency and low energy

Nikita Mishra, Connor Imes, John D Laﬀerty, and Henry Hoﬀmann. Caloree: Learning control for predictable latency and low energy. ACM SIGPLAN Notices, 53(2): 184–198, 2018

work page 2018
[50]

Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer. ACM SIGOPS Operating Systems Review, 53(1):34–39, 2019

work page 2019
[51]

Genera- tive and multi-phase learning for computer systems opti- mization

Yi Ding, Nikita Mishra, and Henry Ho ﬀmann. Genera- tive and multi-phase learning for computer systems opti- mization. In Proceedings of the 46th International Sym- posium on Computer Architecture, pages 39–52, 2019. 14

work page 2019

[1] [1]

Evolve or die: High- availability design principles drawn from googles net- work infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High- availability design principles drawn from googles net- work infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM ’16, page 58–72, New York, NY , USA, 2016. Association for Comput- ing Machinery. ISBN 9781450341936. doi: 10. 1145/2934872...

work page arXiv 2016

[2] [2]

The datacenter as a computer: An introduction to the design of warehouse-scale machines

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013

work page 2013

[3] [3]

zupdate: Updating data center networks with zero loss

Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David Maltz. zupdate: Updating data center networks with zero loss. In Pro- ceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, pages 411–422, 2013

work page 2013

[4] [4]

Ex- plicit path control in commodity data centers: Design and applications

Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. Ex- plicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 15), pages 15–28, 2015

work page 2015

[5] [5]

Gwin, Sathish K Sekar, Sergey N

Parthasarathy Ranganathan, Danner Stodolsky, Je ﬀ Calow, Jeremy Dorfman, Marisabel Guevara Hecht- man, Clint Smullen, Aki Kuusela, Aaron James Laursen, Alex Ramirez, Alvin Adrian Wijaya, Amir Salek, Anna Cheung, Ben Gelb, Brian Fosco, Cho Mon Kyaw, Dake He, David Alexander Munday, David Wickeraad, Devin Persaud, Don Stark, Drew Walton, Elisha Indu- palli,...

work page 2021

[6] [6]

Meet: rack- level pooling based load balancing in datacenter net- works

Jiaqing Dong, Lijuan Tan, Chen Tian, Yuhang Zhou, Yi Wang, Wanchun Dou, and Guihai Chen. Meet: rack- level pooling based load balancing in datacenter net- works. IEEE Transactions on Parallel and Distributed Systems, 2022

work page 2022

[7] [7]

Estimating computation times of data-intensive applications

Shonali Krishnaswamy, Seng Wai Loke, and Arkady Za- slavsky. Estimating computation times of data-intensive applications. IEEE Distributed Systems Online , 5(4), 2004

work page 2004

[8] [8]

Stratus: Cost-aware container scheduling in the public cloud

Andrew Chung, Jun Woo Park, and Gregory R Ganger. Stratus: Cost-aware container scheduling in the public cloud. In Proceedings of the ACM symposium on cloud computing, pages 121–134, 2018

work page 2018

[9] [9]

Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

Carlo Curino, Djellel E Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao. Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

work page 2014

[10] [10]

Network-aware scheduling for data-parallel jobs: Plan when you can

Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. ACM SIGCOMM Computer Communi- cation Review, 45(4):407–420, 2015

work page 2015

[11] [11]

Mor- pheus: Towards automated slos for enterprise clusters

Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Mor- pheus: Towards automated slos for enterprise clusters. 12 In 12th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 16) , pages 117...

work page 2016

[12] [12]

Perforator: eloquent performance mod- els for resource optimization

Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. Perforator: eloquent performance mod- els for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415–427, 2016

work page 2016

[13] [13]

Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Pro- ceedings of the Eleventh European Conference on Com- puter Systems, pages 1–16, 2016

work page 2016

[14] [14]

Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

C˘alin Iorgulescu, Florin Dinu, Aunn Raza, Wajih Ul Hassan, and Willy Zwaenepoel. Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 97–109, 2017

work page 2017

[15] [15]

3sigma: distribution-based cluster scheduling for runtime uncer- tainty

Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A Kozuch, and Gregory R Ganger. 3sigma: distribution-based cluster scheduling for runtime uncer- tainty. In Proceedings of the Thirteenth EuroSys Con- ference, pages 1–17, 2018

work page 2018

[16] [16]

A case for task sampling based learning for cluster job scheduling

Akshay Jajoo, Y Charlie Hu, Xiaojun Lin, and Nan Deng. A case for task sampling based learning for cluster job scheduling. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 19–33, 2022

work page 2022

[17] [17]

Quantile regres- sion

Roger Koenker and Kevin F Hallock. Quantile regres- sion. Journal of economic perspectives, 15(4):143–156, 2001

work page 2001

[18] [18]

The knapsack problem: a survey

Harvey M Salkin and Cornelis A De Kluyver. The knapsack problem: a survey. Naval Research Logistics Quarterly, 22(1):127–144, 1975

work page 1975

[19] [19]

Forecasting, time series, and regression: an applied approach, volume 4

Bruce L Bowerman, Richard T O’Connell, and Anne B Koehler. Forecasting, time series, and regression: an applied approach, volume 4. South-Western Pub, 2005

work page 2005

[20] [20]

Cpld, howpublished = https://github.com/ mikeroyal/cpld-guide,

work page

[21] [21]

com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,

Flash, howpublished = https://www.intel. com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,

work page

[22] [22]

Bic, howpublished = https://github.com/ facebook/openbic,

work page

[23] [23]

Bios, howpublished = https://github.com/ openbios/openbios,

work page

[24] [24]

Nic, howpublished = https://github.com/ netronome/nic-firmware,

work page

[25] [25]

Openbmc, howpublished = https://github.com/ openbmc,

work page

[26] [26]

Lightgbm: A highly eﬃcient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly eﬃcient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017

work page 2017

[27] [27]

Predicting node failure in cloud service systems

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Cheng- gang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the founda- tions of software engineering, pages 480–490, 2018

work page 2018

[28] [28]

Nurd: Negative-unlabeled learning for online datacenter straggler prediction

Yi Ding, Avinash Rao, Hyebin Song, Rebecca Willett, and Henry Hank Hoﬀmann. Nurd: Negative-unlabeled learning for online datacenter straggler prediction. Pro- ceedings of Machine Learning and Systems, 4:190–203, 2022

work page 2022

[29] [29]

Cpr: Composable performance regres- sion for scalable multiprocessor models

Benjamin C Lee, Jamison Collins, Hong Wang, and David Brooks. Cpr: Composable performance regres- sion for scalable multiprocessor models. In 2008 41st IEEE/ACM International Symposium on Microarchitec- ture, pages 270–281. IEEE, 2008

work page 2008

[30] [30]

Generalizable and interpretable learning for conﬁguration extrapolation

Yi Ding, Ahsan Pervaiz, Michael Carbin, and Henry Hoﬀmann. Generalizable and interpretable learning for conﬁguration extrapolation. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of soft- ware engineering, pages 728–740, 2021

work page 2021

[31] [31]

Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud

Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA) , pages 155–168. IEEE, 2022

work page 2022

[32] [32]

Wrangler: Predictable and faster jobs using fewer resources

Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014

work page 2014

[33] [33]

Hypermapper: A practical design space ex- ploration framework

Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. Hypermapper: A practical design space ex- ploration framework. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of 13 Computer and Telecommunication Systems (MASCOTS), pages 425–426. IEEE, 2019

work page 2019

[34] [34]

Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018

[35] [35]

Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoﬀs in nvms

Zhaoxia Deng, Lunkai Zhang, Nikita Mishra, Henry Hoﬀmann, and Frederic T Chong. Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoﬀs in nvms. In Proceedings of the 50th Annual IEEE /ACM International Symposium on Microarchitecture, pages 232–244, 2017

work page 2017

[36] [36]

E ﬃciently explor- ing architectural design spaces via predictive model- ing

Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. E ﬃciently explor- ing architectural design spaces via predictive model- ing. ACM SIGOPS Operating Systems Review, 40(5): 195–206, 2006

work page 2006

[37] [37]

Scikit-learn: Machine learning in python

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12: 2825–2830, 2011

work page 2011

[38] [38]

Apollo: Scalable and coordinated scheduling for cloud-scale computing

Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jin- gren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pages 285–300, 2014

work page 2014

[39] [39]

Ix: A protected dataplane operating system for high throughput and low latency

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. Ix: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 14), pages 49–65, 2014. doi: 10.1145/2997641

work page doi:10.1145/2997641 2014

[40] [40]

Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale

Shaohong Li, Xi Wang, Faria Kalim, Xiao Zhang, Sangeetha Abdu Jyothi, Karan Grover, Vasileios Kon- torinis, Nina Narodytska, Owolabi Legunsen, Sreeku- mar Kodakara, et al. Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale. In 14th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 20) , pages 1241–1255, 2020

work page 2020

[41] [41]

Accurate and eﬃ- cient regression modeling for microarchitectural perfor- mance and power prediction

Benjamin C Lee and David M Brooks. Accurate and eﬃ- cient regression modeling for microarchitectural perfor- mance and power prediction. ACM SIGOPS operating systems review, 40(5):185–194, 2006

work page 2006

[42] [42]

Energy-e ﬃcient soft real-time cpu scheduling for mobile multimedia systems

Wanghong Yuan and Klara Nahrstedt. Energy-e ﬃcient soft real-time cpu scheduling for mobile multimedia systems. ACM SIGOPS Operating Systems Review, 37 (5):149–163, 2003

work page 2003

[43] [43]

An approach to performance prediction for parallel applications

Engin Ipek, Bronis R De Supinski, Martin Schulz, and Sally A McKee. An approach to performance prediction for parallel applications. In European Conference on Parallel Processing, pages 196–205. Springer, 2005

work page 2005

[44] [44]

Perceptron-based prefetch ﬁltering

Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V Gratz, and Daniel A Jiménez. Perceptron-based prefetch ﬁltering. In2019 ACM/IEEE 46th Annual Inter- national Symposium on Computer Architecture (ISCA), pages 1–13. IEEE, 2019

work page 2019

[45] [45]

Bit-level perceptron pre- diction for indirect branches

Elba Garza, Samira Mirbagher-Ajorpaz, Tahsin Ahmad Khan, and Daniel A Jiménez. Bit-level perceptron pre- diction for indirect branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architec- ture (ISCA), pages 27–38. IEEE, 2019

work page 2019

[46] [46]

Applying deep learning to the cache replacement problem

Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019

work page 2019

[47] [47]

Learning scheduling algorithms for data processing clus- ters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clus- ters. In Proceedings of the ACM special interest group on data communication, pages 270–288. 2019

work page 2019

[48] [48]

Paragon: Qos-aware scheduling for heterogeneous datacenters

Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77–88, 2013

work page 2013

[49] [49]

Caloree: Learning control for predictable latency and low energy

Nikita Mishra, Connor Imes, John D Laﬀerty, and Henry Hoﬀmann. Caloree: Learning control for predictable latency and low energy. ACM SIGPLAN Notices, 53(2): 184–198, 2018

work page 2018

[50] [50]

Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer. ACM SIGOPS Operating Systems Review, 53(1):34–39, 2019

work page 2019

[51] [51]

Genera- tive and multi-phase learning for computer systems opti- mization

Yi Ding, Nikita Mishra, and Henry Ho ﬀmann. Genera- tive and multi-phase learning for computer systems opti- mization. In Proceedings of the 46th International Sym- posium on Computer Architecture, pages 39–52, 2019. 14

work page 2019