arxiv: 2512.07750 · v2 · submitted 2025-12-08 · 💻 cs.DC

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

Roozbeh Bostandoost , Pooria Namyar , Siva Kesava Reddy Kakarla , Ryan Beckett , Santiago Segarra , Eli Cortez , Ankur Mallick , Kevin Hsieh

show 3 more authors

Rodrigo Fonseca Mohammad Hajiesmaili Behnaz Arzani

This is my paper

Pith reviewed 2026-05-17 00:07 UTC · model grok-4.3

classification 💻 cs.DC

keywords cloud VM allocationML model interactionsbi-level optimizationdistributional uncertaintyperformance analysisadversarial testingproduction traces

0 comments

The pith

SANJESH uses bi-level optimization to expose how multiple ML predictors in VM allocation can compound into four times worse performance than existing tests find.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SANJESH as a way to analyze interactions among several machine learning models that jointly decide VM placement in a public cloud. These models predict CPU demand, memory use, and instance lifetime, and their outputs together control how many servers are provisioned and how often live migrations occur. Because the models are probabilistic, small shifts in their output distributions can align in ways that degrade overall system metrics far beyond what any single model would suggest. SANJESH casts the problem as a bi-level optimization: an outer loop searches for plausible adversarial distributions, while an inner loop measures the allocator's resulting performance. Applied to the operator's real traces, the method located failure cases that the provider's own evaluator had missed, producing up to fourfold increases in cost-related metrics.

Core claim

SANJESH formulates a bi-level optimization in which the outer level searches over distributional uncertainty in the ML models' predictions to locate adversarial input conditions, while the inner level evaluates the VM allocator's performance given those predictions, revealing scenarios in production traces that cause four times worse performance than the operator's existing evaluator detected.

What carries the argument

Bi-level optimization separating an outer search over possible ML prediction distributions under uncertainty from an inner performance evaluation of the VM allocator.

If this is right

Cloud operators gain a concrete method to surface hidden performance risks before new ML models are rolled into allocation pipelines.
The approach can be reused on any multi-model decision system where individual predictors feed a downstream optimizer.
It quantifies the gap between single-model validation and joint worst-case behavior under realistic distributional shifts.
Findings indicate that deterministic adversarial testing is insufficient for systems whose decisions depend on correlated probabilistic outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bi-level stress testing could be applied to other production ML pipelines such as auto-scaling or network routing where multiple predictors interact.
The results suggest that retraining individual models with explicit awareness of downstream allocation effects may be necessary to close the observed performance gap.
If the uncovered scenarios appear in live traffic, operators would face sustained higher server counts or migration overhead until the models are jointly recalibrated.

Load-bearing premise

The outer-level search over distributional uncertainty accurately represents the range of real-world prediction shifts that the production models can experience.

What would settle it

Re-running the operator's standard evaluator on the same traces while forcing the exact prediction distributions found by SANJESH should still produce only the milder degradation originally reported.

Figures

Figures reproduced from arXiv: 2512.07750 by Ankur Mallick, Behnaz Arzani, Eli Cortez, Kevin Hsieh, Mohammad Hajiesmaili, Pooria Namyar, Rodrigo Fonseca, Roozbeh Bostandoost, Ryan Beckett, Santiago Segarra, Siva Kesava Reddy Kakarla.

**Figure 1.** Figure 1: The CPU model has the most impact on the number of live-migrations. We engineer each of the ML models to have 70% accuracy and only use predictions from one of the models (we use the ground truth for the rest) in each experiment. The y-axis shows how the models change the risk of live-migrations relative to the memory model. Operators initially used cross-validation, experiments on traces from previous day… view at source ↗

**Figure 2.** Figure 2: SANJESH overview. Users provide the ML models, their feature dependencies, and example test data, along with the specific queries they want SANJESH to answer ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of how we model the mechanisms in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Here, when the feature 𝑓2 is large the risk surface (shaded region) is a smaller portion of the feature space (potentially, less likely) compared to when 𝑓2 is small (left); we look at the projection of (𝑓1.𝑓3) on the 𝑓2 plane (right). We split the 𝑁 VMs we want to analyze into 𝑘 partitions based on their arrival time. We solve the bi-level optimization for each partition separately. When we find a soluti… view at source ↗

**Figure 5.** Figure 5: An example of how we use the CEGAR-based approach. We consider a simple multi-class LGBM classifier where the model has one tree per class and takes three features (𝑓1, 𝑓2, 𝑓3) as input. To check if there exists an input feature vector that causes the model to predict CLASS 2 we apply a CEGAR strategy as follows: we cut the trees to depth 1 and approximate the parts we pruned with the minimum leaf value in… view at source ↗

**Figure 6.** Figure 6: SANJESH vs simulation. (left) In under 3 hours, SANJESH finds worse performance scenarios than simulations on the same-length traces. After 9 hours, it finds a case with 4× more degradation. (middle) Across runs, SANJESH consistently uncovers higher migration risk than simulations. (right) SANJESH finds scenarios with 8× more excess servers per epoch. S S𝑀 S𝐶 S𝐿 0 0.5 1 Risk of Migration [PITH_FULL_IMAGE:… view at source ↗

**Figure 7.** Figure 7: Impact of each model on the risk of migration. The CPU model contributes the most, driving up to 6× higher migration risk compared to the memory model. The lifetime model has no impact when the others are accurate. S in the x-axis is SANJESH. the performance of the allocator across realistic workloads and answers operator-driven queries and (2) how SANJESH outperforms simulation-based approaches. 7.1 Metho… view at source ↗

**Figure 8.** Figure 8: 𝐷𝐶 and 𝐷𝐿 denote drift in CPU and Lifetime model data. Introducing 20% and 40% drift causes only a small increase in migration risk. This shows that the VM allocator is robust to moderate drift. SANJESH SANJESHCPUBoost 0 0.5 1 Risk of Migration [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: SANJESH quantifies the improvement we get if we improve the CPU model by 10%. drift for either model. These findings indicate that the VM allocator is robust to moderate data drift. 7.5 SANJESH can analyze a hypothetical model SANJESH showed that the CPU model has the greatest impact on VMA performance ( [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: SANJESH produces a risk surface that aligns with SHAP. (Left) shows feature risk regions across 50+ VMs, (middle) across 100+ VMs, and (right) shows SHAP values. While SHAP ranks features by their influence on model predictions, SANJESH highlights regions (in blue) that cause poor system performance and marks high-entropy, non-actionable features in red. Both methods identify feature A as the most importa… view at source ↗

**Figure 11.** Figure 11: Runtime analysis of the CEGAR-based approach. Each point in the grid indicates whether the CEGAR-based approach has found an example that can produce each of the possible pairs of output labels in those two binary models simultaneously for a given assignment of VM resource size and arrival hour. (left) (1) The CEGAR-based approach progress after 1 minute; (2) after 5 minutes; (right) if we do not use CEGA… view at source ↗

**Figure 12.** Figure 12: LGBM modeling in MetaOpt 101 102 103 0 200 400 600 800 Number of Trees Runtime (seconds) [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Runtime vs number of trees. ∀𝑖, 𝑗 | 𝑖 ≠ 𝑗 𝑃′ 𝑗 − 𝑃 ′ 𝑖 ≤ 𝑀(1 − 𝑥𝑖) ∑︁ 𝑥𝑖 = 1 (2) Here 𝑀 is the traditional “Big-M” (a large number). The second sum in Equation 2 ensures we only return one class as the output of the LGBM (the optimization is free to break ties in any way it chooses). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Flow-graph of two models (M1, M2) and their relative difference 𝑅 =M2 −M1. Each node 𝑀 𝑦 𝑥 denotes model 𝑥 predicting with offset 𝑦 ∈ {0, 1, −1} (0=correct, 1=over, -1=under). Nodes 𝑅 𝑦 show output differences. Paths (e.g., Path1–3) are feasible joint outcomes, with flow values encoding the fraction of VMs. This enforces both individual accuracy and relative-output constraints. and vice versa 𝑦3% of the… view at source ↗

read the original abstract

Cloud operators increasingly deploy multiple ML models in their VM allocation pipelines. In such settings, individually benign predictions can shift and compound, severely degrading performance. In a cloud provider's VM placement pipeline, CPU, memory, and lifetime prediction models jointly determine server count, live migration frequency, and network utilization; yet no existing approach can systematically stress-test how these models adversely interact. Deterministic adversarial analyzers cannot capture probabilistic ML behavior, so operators miss failures that arise only from correlated distributional shifts across models In SANJESH, we formulate a bi-level optimization that captures how the ML models behave statistically and uncovers how they adversely interact. The outer level searches over what predictions the ML models could produce under distributional uncertainty to find adversarial conditions; the inner level evaluates how the VM allocator behaves given those predictions. When we applied it to the operator's production traces, SANJESH uncovered scenarios that cause $4\times$ worse performance than the operators' evaluator detected.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SANJESH uses bi-level optimization to hunt for bad interactions among production ML models in VM allocation and reports 4x worse performance on traces, but the uncertainty sets lack calibration details.

read the letter

The main takeaway is that this paper introduces SANJESH, a bi-level optimizer where the outer loop searches over possible prediction distributions under uncertainty and the inner loop runs the VM allocator to measure impact. It targets the practical case where CPU, memory, and lifetime models each look reasonable on their own but can compound into higher server counts, more migrations, and extra network load when their errors correlate.

Referee Report

1 major / 1 minor

Summary. The paper introduces SANJESH, a bi-level optimization framework for stress-testing interactions among ML models (CPU, memory, and lifetime predictors) in a public cloud VM allocator. The outer level searches over predictions under distributional uncertainty to identify adversarial conditions, while the inner level evaluates allocator performance (server count, migrations, network use). On the operator's production traces, it reports uncovering scenarios with 4× worse performance than the existing evaluator detects.

Significance. If the outer-level uncertainty sets prove calibrated to actual model error distributions, the approach could offer a useful method for exposing compounding failures in ML-augmented cloud systems that deterministic analyzers miss. The bi-level formulation directly models statistical behavior rather than point estimates, which is a conceptual strength for this domain.

major comments (1)

[Abstract] Abstract: The outer-level search is described only at a high level as operating 'under distributional uncertainty,' with no specification of how the uncertainty sets are constructed (moment bounds, support constraints, correlation structure, or empirical calibration to the traces' observed prediction errors). This detail is load-bearing for the 4× claim, because an overly permissive set can generate implausible adversarial predictions unrelated to real ML model shifts.

minor comments (1)

[Abstract] Abstract: Selection criteria for the production traces and the precise definition of the performance metric used to compute the 4× factor are not stated, which limits immediate assessment of the result's robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on our manuscript. We address the single major comment below and will incorporate revisions to strengthen the presentation of the outer-level uncertainty sets.

read point-by-point responses

Referee: [Abstract] Abstract: The outer-level search is described only at a high level as operating 'under distributional uncertainty,' with no specification of how the uncertainty sets are constructed (moment bounds, support constraints, correlation structure, or empirical calibration to the traces' observed prediction errors). This detail is load-bearing for the 4× claim, because an overly permissive set can generate implausible adversarial predictions unrelated to real ML model shifts.

Authors: We agree that the abstract presents the outer-level formulation at a high level and that additional detail on uncertainty-set construction would better support the 4× performance claim. The body of the manuscript (Section 3.2) defines the sets via an empirical Wasserstein ball whose radius is set to the 95th-percentile joint prediction error observed on the production traces; this choice encodes moment bounds through the empirical first- and second-order moments, support constraints from the observed min/max ranges, and correlation structure via the sample covariance of the CPU/memory/lifetime error vectors. We will revise the abstract to include a concise clause stating that the uncertainty sets are calibrated to the empirical error distribution of the production traces. This change will make explicit that the adversarial predictions remain within the statistical envelope of the deployed models rather than arbitrary shifts. revision: yes

Circularity Check

0 steps flagged

No circularity in bi-level optimization; inner evaluation independent of outer search

full rationale

The paper formulates SANJESH as a bi-level optimization in which the outer level searches over ML model predictions under distributional uncertainty to identify adversarial conditions, while the inner level directly evaluates the VM allocator's performance (server count, migration frequency, network utilization) given those predictions. The reported 4× performance degradation is measured by the inner allocator on production traces rather than being defined or fitted from the outer-level uncertainty sets themselves. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the uncertainty sets serve as exogenous inputs to the allocator model, keeping the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the distributional uncertainty model and the performance metric are referenced but not formalized.

pith-pipeline@v0.9.0 · 5504 in / 1037 out tokens · 37685 ms · 2026-05-17T00:07:15.910135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bi-level optimization that captures how the ML models behave statistically... outer level searches over what predictions the ML models could produce under distributional uncertainty
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

time-partitioning technique... bounded model checking and the small-scope hypothesis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

Marcos K Aguilera, Jeffrey C Mogul, Janet L Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for dis- tributed systems of black boxes.ACM SIGOPS Operating Systems Review37, 5 (2003), 74–89

work page 2003
[2]

Guy Amir, Davide Corsi, Raz Yerushalmi, Luca Marzari, David Harel, Alessandro Farinelli, and Guy Katz. 2023. Verifying learning-based robotic navigation systems. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 607–627

work page 2023
[3]

Guy Amir, Michael Schapira, and Guy Katz. 2021. Towards scalable verification of deep reinforcement learning. In2021 formal methods in computer aided design (FMCAD). IEEE, 193–203

work page 2021
[4]

Daniel W Apley and Jingyu Zhu. 2020. Visualizing the effects of predictor variables in black box supervised learning models.Journal of the Royal Statistical Society Series B: Statistical Methodology82, 4 (2020), 1059–1086

work page 2020
[5]

Mina Tahmasbi Arashloo, Ryan Beckett, and Rachit Agarwal. 2023. Formal methods for network performance analysis. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 645–661

work page 2023
[6]

Venkat Arun, Mina Tahmasbi Arashloo, Ahmed Saeed, Mohammad Alizadeh, and Hari Balakrishnan. 2021. Toward formally verifying congestion control behavior(SIGCOMM ’21)

work page 2021
[7]

Behnaz Arzani, Sina Taheri, Pooria Namyar, Ryan Beckett, Siva Ke- sava Reddy Kakarla, and Elnaz Jallilipour. 2025. Raha: A General Tool to Analyze W AN Degradation. InProceedings of the 2025 ACM SIGCOMM Conference

work page 2025
[8]

Hugo Barbalho, Patricia Kovaleski, Beibin Li, Luke Marshall, Marco Molinaro, Abhisek Pan, Eli Cortez, Matheus Leao, Harsh Patwari, Zuzu Tang, et al. 2023. Virtual machine allocation with lifetime predictions. Proceedings of Machine Learning and Systems5 (2023)

work page 2023
[10]

Daniel S Berger. 2018. Towards lightweight and robust machine learn- ing for CDN caching. InProceedings of the 17th ACM Workshop on Hot Topics in Networks. 134–140

work page 2018
[11]

1997.Introduction to linear optimization

Dimitris Bertsimas and John N Tsitsiklis. 1997.Introduction to linear optimization. V ol. 6. Athena scientific Belmont, MA

work page 1997
[12]

Mark Brunelli. 2015. Downtime costs small businesses up to $427 per minute.https://www .carbonite.com/blog/2015/downtime-costs- small-businesses-up-to-427-per-minute/

work page 2015
[13]

2024.Statistical inference

George Casella and Roger Berger. 2024.Statistical inference. CRC press

work page 2024
[14]

Hongge Chen, Huan Zhang, Duane Boning, and Cho-Jui Hsieh. 2019. Robust Decision Trees Against Adversarial Examples. InProceedings of the 36th International Conference on Machine Learning. 1122– 1131

work page 2019
[15]

Hongge Chen, Huan Zhang, Si Si, Yang Li, Duane Boning, and Cho-Jui Hsieh. 2019. Robustness verification of tree-based models.Advances in Neural Information Processing Systems32 (2019)

work page 2019
[16]

Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith. 2000. Counterexample-guided abstraction refinement. InCom- puter Aided Verification: 12th International Conference, CAV 2000, Chicago, IL, USA, July 15-19, 2000. Proceedings 12. Springer, 154– 169

work page 2000
[17]

Clarke, Armin Biere, Richard Raimi, and Yunshan Zhu

Edmund M. Clarke, Armin Biere, Richard Raimi, and Yunshan Zhu

work page
[18]

doi:10 .1023/A: 1011254632723

Bounded Model Checking Using Satisfiability Solving.For- mal Methods in System Design19, 1 (2001), 7–34. doi:10 .1023/A: 1011254632723

work page 2001
[19]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar- cus Fontoura, and Ricardo Bianchini. 2017. Resource central: Under- standing and predicting workloads for improved resource management in large cloud platforms. InProceedings of the 26th Symposium on Operating Systems Principles. 153–167

work page 2017
[20]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

work page 2008
[21]

Laurens Devos, Wannes Meert, and Jesse Davis. 2021. Versatile ver- ification of tree ensembles. InInternational Conference on Machine Learning. PMLR, 2654–2664

work page 2021
[22]

Gil Einziger, Maayan Goldstein, Yaniv Sa’ar, and Itai Segall. 2019. Verifying robustness of gradient boosted models. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty- First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelli- gence

work page 2019
[23]

Tomer Eliyahu, Yafim Kazak, Guy Katz, and Michael Schapira. 2021. Verifying learning-augmented systems. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 305–318

work page 2021
[24]

Matteo Fischetti and Jason Jo. 2018. Deep neural networks and mixed integer linear optimization.Constraints23, 3 (2018), 296–309

work page 2018
[25]

Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker

work page
[26]

In4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07)

{X-Trace}: A pervasive network tracing framework. In4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07)

work page
[27]

Saksham Goel, Benjamin Mikek, Jehad Aly, Venkat Arun, Ahmed Saeed, and Aditya Akella. 2023. Quantitative verification of scheduling heuristics.arXiv preprint arXiv:2301.04205(2023)

work page arXiv 2023
[28]

Lexiang Huang and Timothy Zhu. 2021. tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces. InProceedings of the ACM Symposium on Cloud Computing. 76–91

work page 2021
[29]

Minhao Jin and Maria Apostolaki. 2024. Robustifying ML-powered Network Classifiers with PANTS.arXiv preprint arXiv:2409.04691 (2024)

work page arXiv 2024
[30]

Siva Kesava Reddy Kakarla, Ryan Beckett, Behnaz Arzani, Todd Mill- stein, and George Varghese. 2020. Groot: Proactive verification of dns configurations. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 310–328

work page 2020
[31]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. 2017. Canopy: An end-to-end performance tracing and analysis system. InProceedings of the 26th symposium on operating systems principles. 34–50

work page 2017
[32]

Alex Kantchelian, J. D. Tygar, and Anthony Joseph. 2016. Evasion and Hardening of Tree Ensemble Classifiers. InProceedings of The 33rd International Conference on Machine Learning

work page 2016
[33]

Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. 2022. Reluplex: a calculus for reasoning about deep neural networks.Formal Methods in System Design60, 1 (2022), 87–116

work page 2022
[34]

Guy Katz, Derek A Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Alek- sandar Zelji´c, et al. 2019. The marabou framework for verification and analysis of deep neural networks. InComputer Aided Verification: 31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Pro...

work page 2019
[35]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Wei- dong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in neural informa- tion processing systems, V ol. 30. Curran Associates, Inc., 3146–3154

work page 2017
[36]

James C King. 1976. Symbolic execution and program testing.Com- mun. ACM19, 7 (1976), 385–394. 12

work page 1976
[37]

Arnd Christian König, Yi Shan, Karan Newatia, Luke Marshall, and Vivek Narasayya. 2023. Solver-In-The-Loop Cluster Resource Manage- ment for Database-as-a-Service.Proceedings of the VLDB Endowment 16, 13 (2023), 4254–4267

work page 2023
[38]

Sang Gyu Kwak and Jong Hae Kim. 2017. Central limit theorem: the cornerstone of modern statistics.Korean journal of anesthesiology70, 2 (2017), 144

work page 2017
[39]

Alessio Lomuscio and Lalit Maganti. 2017. An approach to reacha- bility analysis for feed-forward relu neural networks.arXiv preprint arXiv:1706.07351(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Scott Lundberg. 2017. A unified approach to interpreting model pre- dictions.arXiv preprint arXiv:1705.07874(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural adaptive video streaming with pensieve. InProceedings of the con- ference of the ACM special interest group on data communication. 197–210

work page 2017
[42]

Luke Merrick and Ankur Taly. 2020. The explanation game: Explaining machine learning models using shapley values. InMachine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2020, Dublin, Ireland, August 25–28, 2020, Proceedings 4. Springer, 17–38

work page 2020
[43]

2020.Interpretable machine learning

Christoph Molnar. 2020.Interpretable machine learning. Lulu. com

work page 2020
[44]

Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Hi- manshu Raj, and Srikanth Kandula. 2022. Minding the gap between fast heuristics and their optimal counterparts. InProceedings of the 21st ACM Workshop on Hot Topics in Networks (HotNets ’22)

work page 2022
[45]

Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Hi- manshu Raj, Umesh Krishnaswamy, Ramesh Govindan, and Srikanth Kandula. 2024. Finding adversarial inputs for heuristics using multi- level optimization. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 927–949

work page 2024
[46]

Pooria Namyar, Michael Schapira, Ramesh Govindan, Santiago Segarra, Ryan Beckett, Siva Kesava Reddy Kakarla, and Behnaz Arzani

work page
[47]

InProceedings of the 23rd ACM Workshop on Hot Topics in Networks

End-to-End Performance Analysis of Learning-enabled Systems. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks. 86–94

work page
[48]

Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards account- able AI: Hybrid human-machine analyses for characterizing system failure. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, V ol. 6. 126–135

work page 2018
[49]

Maddy Osman. [n. d.]. How to calculate the true cost of ecommerce downtime and reduce its impact.https://www .nexcess.net/blog/ ecommerce-downtime/

work page
[50]

Joo Pedro Pedroso. 2011. Optimization with gurobi and python.INESC Porto and Universidade do Porto„ Porto, Portugal1 (2011)

work page 2011
[51]

Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. 2023. {DOTE}: Rethinking (Predictive){W AN} Traffic Engineering. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1557–1581

work page 2023
[52]

P Jonathon Phillips, P Jonathon Phillips, Carina A Hahn, Peter C Fontana, Amy N Yates, Kristen Greene, David A Broniatowski, and Mark A Przybocki. 2021. Four principles of explainable artificial intelligence. (2021)

work page 2021
[53]

Adam Ruprecht, Danny Jones, Dmitry Shiraev, Greg Harmon, Maya Spivak, Michael Krebs, Miche Baker-Harvey, and Tyler Sanderson

work page
[54]

Vm live migration at scale.ACM SIGPLAN Notices53, 3 (2018), 45–56

work page 2018
[55]

Naoto Sato, Hironobu Kuruma, Yuichiroh Nakagawa, and Hideto Ogawa. 2019. Formal verification of decision-tree ensemble model and detection of its violating-input-value ranges.arXiv preprint arXiv:1904.11753(2019)

work page arXiv 2019
[56]

Benjamin H Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chan- dan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010)

work page 2010
[57]

Zhenyu Song, Daniel S Berger, Kai Li, Anees Shaikh, Wyatt Lloyd, Soudeh Ghorbani, Changhoon Kim, Aditya Akella, Arvind Krishna- murthy, Emmett Witchel, et al . 2020. Learning relaxed belady for content distribution network caching. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 529–544

work page 2020
[58]

Ryuichi Takanobu, Qi Zhu, Jinchao Li, Baolin Peng, Jianfeng Gao, and Minlie Huang. 2020. Is Your Goal-Oriented Dialog Model Perform- ing Really Well? Empirical Analysis of System-wise Evaluation. In 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 297

work page 2020
[59]

Pingdom Team. 2023. Average Cost of Downtime per In- dustry.https://www .pingdom.com/outages/average-cost-of- downtime-per-industry/

work page 2023
[60]

Vincent Tjeng, Kai Xiao, and Russ Tedrake. 2017. Evaluating ro- bustness of neural networks with mixed integer programming.arXiv preprint arXiv:1711.07356(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

John Törnblom and Simin Nadjm-Tehrani. 2019. An abstraction- refinement approach to formal verification of tree ensembles. InCom- puter Safety, Reliability, and Security: SAFECOMP 2019 Workshops, ASSURE, DECSoS, SASSUR, STRIVE, and WAISE, Turku, Finland, September 10, 2019, Proceedings 38. Springer, 301–313

work page 2019
[62]

John Törnblom and Simin Nadjm-Tehrani. 2020. Formal verification of input-output mappings of tree ensembles.Science of Computer Programming194 (2020), 102450

work page 2020
[63]

Yihan Wang, Huan Zhang, Hongge Chen, Duane Boning, and Cho-Jui Hsieh. 2020. On lp-norm robustness of ensemble decision stumps and trees. InProceedings of the 37th International Conference on Machine Learning

work page 2020
[64]

Doris Xin, Hui Miao, Aditya Parameswaran, and Neoklis Polyzotis

work page
[65]

InProceedings of the 2021 international conference on management of data

Production machine learning pipelines: Empirical analysis and optimization opportunities. InProceedings of the 2021 international conference on management of data. 2639–2652

work page 2021
[66]

adversarial feature space

Chong Zhang, Huan Zhang, and Cho-Jui Hsieh. 2020. An efficient adversarial attack for tree ensembles.Advances in neural information processing systems33 (2020), 16165–16176. 13 APPENDIX A SANJESHsupported queries Table 4 summarizes the queries that SANJESHsupports. B Probabilities on finite samples In §5 we described how many queries in SANJESHrequire we ...

work page 2020