Cost-aware Duration Prediction for Software Upgrades in Datacenters
Pith reviewed 2026-05-24 09:56 UTC · model grok-4.3
The pith
Acela predicts software upgrade durations in datacenters while accounting for asymmetric misprediction costs to raise scheduling efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a duration prediction system which explicitly models asymmetric costs of prediction errors, selects appropriate models, and mitigates straggler-induced overestimations can raise upgrade window utilization by a factor of 1.25, increase the number of scheduled upgrades by 33 percent and completed upgrades by 41 percent, and cut cancellation rates by a factor of 2.4 when evaluated on Meta's production datacenter systems.
What carries the argument
Acela, the cost-aware duration prediction framework that strategically selects predictive models based on misprediction costs and adjusts for stragglers.
If this is right
- More upgrades can be completed without expanding maintenance windows.
- Upgrade cancellations due to time overruns decrease substantially.
- Service level objectives are met more reliably during upgrade periods.
- The overall throughput of the upgrade scheduler increases.
Where Pith is reading between the lines
- The same cost-aware selection logic might improve duration predictions for other datacenter tasks such as data migrations or hardware servicing.
- If the framework generalizes, operators could apply it to new upgrade types with minimal additional tuning.
- Datacenters with more heterogeneous hardware might require extensions to handle a wider range of straggler behaviors.
Load-bearing premise
That the upgrade characteristics and workloads observed in Meta's datacenters are representative enough that models trained there will achieve similar gains in other environments.
What would settle it
Deploying Acela on upgrade logs from a second large-scale datacenter operator and measuring whether the reported improvements in utilization, throughput, and cancellation rates still appear.
Figures
read the original abstract
Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly increases efficiency of the existing upgrade scheduler by improving upgrade window utilization by 1.25X, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4X. The code and data sets will be released after paper acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript characterizes software upgrade types in datacenters and frames upgrade scheduling as a constrained optimization problem. It introduces Acela, a cost-aware duration prediction framework that strategically selects models while accounting for asymmetric misprediction costs and mitigating straggler-induced overestimations to meet SLOs. Production evaluations at Meta report that Acela improves existing scheduler efficiency by 1.25X in upgrade window utilization, increases scheduled and completed upgrades by 33% and 41%, and reduces cancellation rates by 2.4X. Code and datasets are promised for release.
Significance. If the empirical gains hold under broader conditions, the work addresses an under-explored scheduling domain with direct operational relevance for large-scale datacenters. The explicit handling of cost asymmetry and stragglers, combined with the commitment to release artifacts, would strengthen reproducibility and enable follow-on studies.
major comments (1)
- [Evaluation] Evaluation section: The reported gains (1.25X window utilization, +33%/41% scheduled/completed upgrades, 2.4X lower cancellations) rest exclusively on a single Meta production trace. No cross-site experiments, synthetic workloads, or sensitivity analysis varying upgrade-duration distributions, straggler frequency, or misprediction-cost ratios are presented, so it is unclear whether the constrained optimizer would deliver comparable throughput under different workload traits.
Simulated Author's Rebuttal
We thank the referee for the feedback. The evaluation concern is valid and we address it directly below.
read point-by-point responses
-
Referee: Evaluation section: The reported gains (1.25X window utilization, +33%/41% scheduled/completed upgrades, 2.4X lower cancellations) rest exclusively on a single Meta production trace. No cross-site experiments, synthetic workloads, or sensitivity analysis varying upgrade-duration distributions, straggler frequency, or misprediction-cost ratios are presented, so it is unclear whether the constrained optimizer would deliver comparable throughput under different workload traits.
Authors: We agree the evaluation uses a single production trace from Meta. This trace captures the actual upgrade types, straggler patterns, and cost asymmetries observed in a large-scale operational setting, which is the intended deployment environment. Cross-site experiments are not possible because comparable traces from other operators are unavailable due to proprietary constraints. We will, however, add sensitivity analysis in the revision: using the promised dataset release, we will perturb upgrade-duration distributions, straggler rates, and misprediction-cost ratios and re-run the constrained optimizer to quantify throughput variation. This addresses the robustness question while remaining within the scope of the released artifacts. revision: partial
- Cross-site experiments cannot be conducted because production traces from other datacenter operators are not accessible.
Circularity Check
No circularity; empirical evaluation on production traces
full rationale
The paper frames upgrade scheduling as a constrained optimization problem and introduces Acela as a cost-aware prediction framework, but all load-bearing claims are empirical measurements (1.25X utilization, +33/41% upgrades, 2.4X lower cancellations) obtained by running the system on Meta production traces. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the reported gains are direct outcomes of deployment rather than reductions of a derivation to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Evolve or die: High- availability design principles drawn from googles net- work infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High- availability design principles drawn from googles net- work infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM ’16, page 58–72, New York, NY , USA, 2016. Association for Comput- ing Machinery. ISBN 9781450341936. doi: 10. 1145/2934872...
-
[2]
The datacenter as a computer: An introduction to the design of warehouse-scale machines
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013
work page 2013
-
[3]
zupdate: Updating data center networks with zero loss
Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David Maltz. zupdate: Updating data center networks with zero loss. In Pro- ceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, pages 411–422, 2013
work page 2013
-
[4]
Ex- plicit path control in commodity data centers: Design and applications
Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. Ex- plicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 15), pages 15–28, 2015
work page 2015
-
[5]
Gwin, Sathish K Sekar, Sergey N
Parthasarathy Ranganathan, Danner Stodolsky, Je ff Calow, Jeremy Dorfman, Marisabel Guevara Hecht- man, Clint Smullen, Aki Kuusela, Aaron James Laursen, Alex Ramirez, Alvin Adrian Wijaya, Amir Salek, Anna Cheung, Ben Gelb, Brian Fosco, Cho Mon Kyaw, Dake He, David Alexander Munday, David Wickeraad, Devin Persaud, Don Stark, Drew Walton, Elisha Indu- palli,...
work page 2021
-
[6]
Meet: rack- level pooling based load balancing in datacenter net- works
Jiaqing Dong, Lijuan Tan, Chen Tian, Yuhang Zhou, Yi Wang, Wanchun Dou, and Guihai Chen. Meet: rack- level pooling based load balancing in datacenter net- works. IEEE Transactions on Parallel and Distributed Systems, 2022
work page 2022
-
[7]
Estimating computation times of data-intensive applications
Shonali Krishnaswamy, Seng Wai Loke, and Arkady Za- slavsky. Estimating computation times of data-intensive applications. IEEE Distributed Systems Online , 5(4), 2004
work page 2004
-
[8]
Stratus: Cost-aware container scheduling in the public cloud
Andrew Chung, Jun Woo Park, and Gregory R Ganger. Stratus: Cost-aware container scheduling in the public cloud. In Proceedings of the ACM symposium on cloud computing, pages 121–134, 2018
work page 2018
-
[9]
Carlo Curino, Djellel E Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao. Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014
work page 2014
-
[10]
Network-aware scheduling for data-parallel jobs: Plan when you can
Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. ACM SIGCOMM Computer Communi- cation Review, 45(4):407–420, 2015
work page 2015
-
[11]
Mor- pheus: Towards automated slos for enterprise clusters
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Mor- pheus: Towards automated slos for enterprise clusters. 12 In 12th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 16) , pages 117...
work page 2016
-
[12]
Perforator: eloquent performance mod- els for resource optimization
Kaushik Rajan, Dharmesh Kakadia, Carlo Curino, and Subru Krishnan. Perforator: eloquent performance mod- els for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415–427, 2016
work page 2016
-
[13]
Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Pro- ceedings of the Eleventh European Conference on Com- puter Systems, pages 1–16, 2016
work page 2016
-
[14]
C˘alin Iorgulescu, Florin Dinu, Aunn Raza, Wajih Ul Hassan, and Willy Zwaenepoel. Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 97–109, 2017
work page 2017
-
[15]
3sigma: distribution-based cluster scheduling for runtime uncer- tainty
Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A Kozuch, and Gregory R Ganger. 3sigma: distribution-based cluster scheduling for runtime uncer- tainty. In Proceedings of the Thirteenth EuroSys Con- ference, pages 1–17, 2018
work page 2018
-
[16]
A case for task sampling based learning for cluster job scheduling
Akshay Jajoo, Y Charlie Hu, Xiaojun Lin, and Nan Deng. A case for task sampling based learning for cluster job scheduling. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 19–33, 2022
work page 2022
-
[17]
Roger Koenker and Kevin F Hallock. Quantile regres- sion. Journal of economic perspectives, 15(4):143–156, 2001
work page 2001
-
[18]
The knapsack problem: a survey
Harvey M Salkin and Cornelis A De Kluyver. The knapsack problem: a survey. Naval Research Logistics Quarterly, 22(1):127–144, 1975
work page 1975
-
[19]
Forecasting, time series, and regression: an applied approach, volume 4
Bruce L Bowerman, Richard T O’Connell, and Anne B Koehler. Forecasting, time series, and regression: an applied approach, volume 4. South-Western Pub, 2005
work page 2005
-
[20]
Cpld, howpublished = https://github.com/ mikeroyal/cpld-guide,
-
[21]
com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,
Flash, howpublished = https://www.intel. com/content/www/us/en/download/17903/ intel-ssd-firmware-update-tool.html ,
-
[22]
Bic, howpublished = https://github.com/ facebook/openbic,
-
[23]
Bios, howpublished = https://github.com/ openbios/openbios,
-
[24]
Nic, howpublished = https://github.com/ netronome/nic-firmware,
-
[25]
Openbmc, howpublished = https://github.com/ openbmc,
-
[26]
Lightgbm: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017
work page 2017
-
[27]
Predicting node failure in cloud service systems
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Cheng- gang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the founda- tions of software engineering, pages 480–490, 2018
work page 2018
-
[28]
Nurd: Negative-unlabeled learning for online datacenter straggler prediction
Yi Ding, Avinash Rao, Hyebin Song, Rebecca Willett, and Henry Hank Hoffmann. Nurd: Negative-unlabeled learning for online datacenter straggler prediction. Pro- ceedings of Machine Learning and Systems, 4:190–203, 2022
work page 2022
-
[29]
Cpr: Composable performance regres- sion for scalable multiprocessor models
Benjamin C Lee, Jamison Collins, Hong Wang, and David Brooks. Cpr: Composable performance regres- sion for scalable multiprocessor models. In 2008 41st IEEE/ACM International Symposium on Microarchitec- ture, pages 270–281. IEEE, 2008
work page 2008
-
[30]
Generalizable and interpretable learning for configuration extrapolation
Yi Ding, Ahsan Pervaiz, Michael Carbin, and Henry Hoffmann. Generalizable and interpretable learning for configuration extrapolation. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of soft- ware engineering, pages 728–740, 2021
work page 2021
-
[31]
Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud
Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. Retail: Opting for learning sim- plicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA) , pages 155–168. IEEE, 2022
work page 2022
-
[32]
Wrangler: Predictable and faster jobs using fewer resources
Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14, 2014
work page 2014
-
[33]
Hypermapper: A practical design space ex- ploration framework
Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. Hypermapper: A practical design space ex- ploration framework. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of 13 Computer and Telecommunication Systems (MASCOTS), pages 425–426. IEEE, 2019
work page 2019
-
[34]
Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An au- tomated end-to-end optimizing compiler for deep learn- ing. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018
work page 2018
-
[35]
Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoffs in nvms
Zhaoxia Deng, Lunkai Zhang, Nikita Mishra, Henry Hoffmann, and Frederic T Chong. Memory cocktail therapy: A general learning-based framework to opti- mize dynamic tradeoffs in nvms. In Proceedings of the 50th Annual IEEE /ACM International Symposium on Microarchitecture, pages 232–244, 2017
work page 2017
-
[36]
E fficiently explor- ing architectural design spaces via predictive model- ing
Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. E fficiently explor- ing architectural design spaces via predictive model- ing. ACM SIGOPS Operating Systems Review, 40(5): 195–206, 2006
work page 2006
-
[37]
Scikit-learn: Machine learning in python
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12: 2825–2830, 2011
work page 2011
-
[38]
Apollo: Scalable and coordinated scheduling for cloud-scale computing
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jin- gren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pages 285–300, 2014
work page 2014
-
[39]
Ix: A protected dataplane operating system for high throughput and low latency
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. Ix: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 14), pages 49–65, 2014. doi: 10.1145/2997641
-
[40]
Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale
Shaohong Li, Xi Wang, Faria Kalim, Xiao Zhang, Sangeetha Abdu Jyothi, Karan Grover, Vasileios Kon- torinis, Nina Narodytska, Owolabi Legunsen, Sreeku- mar Kodakara, et al. Thunderbolt:throughput- optimized,quality-of-service-aware power capping at scale. In 14th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 20) , pages 1241–1255, 2020
work page 2020
-
[41]
Accurate and effi- cient regression modeling for microarchitectural perfor- mance and power prediction
Benjamin C Lee and David M Brooks. Accurate and effi- cient regression modeling for microarchitectural perfor- mance and power prediction. ACM SIGOPS operating systems review, 40(5):185–194, 2006
work page 2006
-
[42]
Energy-e fficient soft real-time cpu scheduling for mobile multimedia systems
Wanghong Yuan and Klara Nahrstedt. Energy-e fficient soft real-time cpu scheduling for mobile multimedia systems. ACM SIGOPS Operating Systems Review, 37 (5):149–163, 2003
work page 2003
-
[43]
An approach to performance prediction for parallel applications
Engin Ipek, Bronis R De Supinski, Martin Schulz, and Sally A McKee. An approach to performance prediction for parallel applications. In European Conference on Parallel Processing, pages 196–205. Springer, 2005
work page 2005
-
[44]
Perceptron-based prefetch filtering
Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V Gratz, and Daniel A Jiménez. Perceptron-based prefetch filtering. In2019 ACM/IEEE 46th Annual Inter- national Symposium on Computer Architecture (ISCA), pages 1–13. IEEE, 2019
work page 2019
-
[45]
Bit-level perceptron pre- diction for indirect branches
Elba Garza, Samira Mirbagher-Ajorpaz, Tahsin Ahmad Khan, and Daniel A Jiménez. Bit-level perceptron pre- diction for indirect branches. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architec- ture (ISCA), pages 27–38. IEEE, 2019
work page 2019
-
[46]
Applying deep learning to the cache replacement problem
Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. Applying deep learning to the cache replacement problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–425, 2019
work page 2019
-
[47]
Learning scheduling algorithms for data processing clus- ters
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clus- ters. In Proceedings of the ACM special interest group on data communication, pages 270–288. 2019
work page 2019
-
[48]
Paragon: Qos-aware scheduling for heterogeneous datacenters
Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77–88, 2013
work page 2013
-
[49]
Caloree: Learning control for predictable latency and low energy
Nikita Mishra, Connor Imes, John D Lafferty, and Henry Hoffmann. Caloree: Learning control for predictable latency and low energy. ACM SIGPLAN Notices, 53(2): 184–198, 2018
work page 2018
-
[50]
Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. Lever- aging deep learning to improve performance predictabil- ity in cloud microservices with seer. ACM SIGOPS Operating Systems Review, 53(1):34–39, 2019
work page 2019
-
[51]
Genera- tive and multi-phase learning for computer systems opti- mization
Yi Ding, Nikita Mishra, and Henry Ho ffmann. Genera- tive and multi-phase learning for computer systems opti- mization. In Proceedings of the 46th International Sym- posium on Computer Architecture, pages 39–52, 2019. 14
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.