Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces
Pith reviewed 2026-05-08 03:40 UTC · model grok-4.3
The pith
Thermal Budget Annealing maps feasible regions first to cut wasted trials when optimizing ML deployments in crash-prone hierarchical spaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that in hierarchical mixed-variable deployment spaces where valid configurations are rare, an explicit feasible-first exploration stage called Thermal Budget Annealing outperforms cold-start TPE by first mapping valid and feasible regions, then warm-starting the optimizer, with added mechanisms of trial timeouts to abort infeasible runs early and subspace blacklisting to suppress repeatedly failing categorical subspaces.
What carries the argument
Thermal Budget Annealing (TBA), a feasible-first exploration procedure that maps valid and feasible regions before warm-starting TPE, using trial timeouts and subspace blacklisting for hostile hardware.
If this is right
- A larger share of the evaluation budget reaches valid model-family configurations under tight latency and memory limits.
- Early timeouts and blacklisting reduce the cost of each wasted trial in spaces with hidden crash zones.
- The approach applies directly to joint search over model family, quantization scheme, runtime backend, and serving configuration.
- DeployBench supplies a repeatable testbed with hierarchical structure, unequal evaluation costs, and hidden invalid regions for comparing such methods.
Where Pith is reading between the lines
- The same feasible-first mapping step could transfer to other black-box problems with sparse feasible sets, such as circuit design or molecular property optimization.
- Blacklisting repeated failures in categorical subspaces may generalize to adaptive pruning techniques in broader hierarchical optimization.
- The hybrid structure suggests testing TBA as a front-end for other model-based optimizers beyond TPE in mixed-variable settings.
Load-bearing premise
That an explicit feasible-first exploration stage followed by warm-started TPE will outperform cold-start TPE without the mapping phase missing high-value regions or the blacklisting discarding useful subspaces.
What would settle it
Run TBA and cold-start TPE on a synthetic hierarchical space engineered with small, disconnected feasible pockets and measure whether TBA still reduces the fraction of budget spent on invalid trials or instead wastes early evaluations on failed mapping attempts.
Figures
read the original abstract
Deploying machine learning models under production constraints requires joint optimization over model family, quantization scheme, runtime backend, and serving configuration. This induces a hierarchical mixed-variable search space in which many configurations are invalid: evaluations may crash, exceed memory limits, or violate latency constraints. Standard black-box optimizers such as Tree-structured Parzen Estimators (TPE) and constrained Bayesian optimization are effective when valid configurations are common, but they can spend a large fraction of a small evaluation budget on invalid or uninformative trials in hostile deployment spaces. This paper studies that regime and asks whether optimization should be decomposed into an explicit exploration stage followed by model-guided exploitation. We propose Thermal Budget Annealing (TBA), a feasible-first exploration procedure that maps valid and feasible regions before warm-starting TPE. The method includes two robustness mechanisms for hostile hardware: trial timeouts that abort clearly infeasible evaluations early, and subspace blacklisting that temporarily suppresses categorical subspaces after repeated failures. We also introduce DeployBench, a benchmark suite for deployment optimization with hierarchical structure, hidden crash zones, hard constraints, and unequal evaluation costs. On synthetic benchmarks and real GPU deployment with five pre-trained vision models across five GPU targets (NVIDIA H100, A100, RTX 5080, L4, and T4), the proposed hybrid improves model-family discovery under tight constraints while reducing wasted budget relative to cold-start TPE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Thermal Budget Annealing (TBA), a feasible-first exploration procedure that first maps valid and feasible regions in crash-prone hierarchical mixed-variable spaces for ML deployment optimization before warm-starting Tree-structured Parzen Estimators (TPE). It incorporates trial timeouts and subspace blacklisting as robustness mechanisms for hostile hardware, introduces the DeployBench benchmark suite with hierarchical structure and hidden crash zones, and reports empirical gains in model-family discovery under tight constraints plus reduced wasted budget versus cold-start TPE on synthetic benchmarks and real GPU tests involving five pre-trained vision models across NVIDIA H100, A100, RTX 5080, L4, and T4 targets.
Significance. If the central empirical claims hold after addressing the noted gaps, the decomposition into explicit feasible-first mapping followed by warm-started exploitation could offer a useful practical strategy for budget-constrained optimization in deployment settings where invalid configurations dominate. The DeployBench suite represents a concrete contribution that may enable standardized evaluation of methods handling unequal evaluation costs and constraint violations.
major comments (2)
- [Subspace blacklisting mechanism (Methods)] Subspace blacklisting mechanism (Methods): The procedure suppresses categorical subspaces after repeated failures but provides no bound on the false-positive blacklisting rate, no reactivation schedule, and no recovery analysis for subspaces that may contain rare valid configurations (e.g., specific quantization+backend pairs). This is load-bearing for the central claim, as irreversible blacklisting within the evaluation budget would prevent the subsequent warm-started TPE from recovering missed optima and could violate the assumption that the hybrid safely identifies valid regions without discarding useful subspaces.
- [Experimental results (Experiments section)] Experimental results (Experiments section): The reported improvements on synthetic benchmarks and real GPU deployments lack error bars, ablations isolating the feasible-first mapping stage from the warm-start TPE component, and quantitative details on how the mapping phase allocates budget or avoids missing high-value regions. Without these, the claim of reduced wasted budget relative to cold-start TPE cannot be fully verified and the weakest assumption (that the hybrid does not miss high-value regions) remains untested.
minor comments (2)
- [Abstract] Abstract: The abstract states empirical improvements but does not reference specific tables, figures, or quantitative metrics (e.g., wasted budget fractions or discovery rates); cross-reference these explicitly for clarity.
- [Methods] Notation and reproducibility: Ensure the full methods section includes pseudocode for TBA, explicit definitions of the hierarchical space variables, and the exact criteria for trial timeouts and blacklisting triggers to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the subspace blacklisting mechanism and the experimental validation that we will strengthen in revision. We respond point by point below.
read point-by-point responses
-
Referee: [Subspace blacklisting mechanism (Methods)] Subspace blacklisting mechanism (Methods): The procedure suppresses categorical subspaces after repeated failures but provides no bound on the false-positive blacklisting rate, no reactivation schedule, and no recovery analysis for subspaces that may contain rare valid configurations (e.g., specific quantization+backend pairs). This is load-bearing for the central claim, as irreversible blacklisting within the evaluation budget would prevent the subsequent warm-started TPE from recovering missed optima and could violate the assumption that the hybrid safely identifies valid regions without discarding useful subspaces.
Authors: We agree that the current Methods description would benefit from greater precision on the blacklisting procedure. The manuscript already characterizes blacklisting as temporary suppression after repeated failures within a subspace, but we will revise the text to explicitly state the reactivation schedule (reactivation occurs after a cooldown interval scaled by the current annealing temperature, allowing re-exploration of previously suppressed subspaces). We will also add an empirical recovery analysis in the Experiments section, reporting the fraction of blacklisted subspaces that are later successfully re-sampled and confirming that high-value feasible configurations are recovered before the TPE exploitation phase. While we do not claim a theoretical bound on the false-positive blacklisting rate (the mechanism is a practical heuristic for hostile hardware rather than a provably safe filter), the combination of temporary suppression, trial timeouts, and subsequent warm-started TPE ensures that the hybrid does not permanently discard useful subspaces. These clarifications and added analyses will directly address the load-bearing concern. revision: yes
-
Referee: [Experimental results (Experiments section)] Experimental results (Experiments section): The reported improvements on synthetic benchmarks and real GPU deployments lack error bars, ablations isolating the feasible-first mapping stage from the warm-start TPE component, and quantitative details on how the mapping phase allocates budget or avoids missing high-value regions. Without these, the claim of reduced wasted budget relative to cold-start TPE cannot be fully verified and the weakest assumption (that the hybrid does not miss high-value regions) remains untested.
Authors: We concur that the experimental presentation requires these additions to fully substantiate the claims. In the revised manuscript we will include: error bars on all performance metrics computed over at least five independent random seeds; dedicated ablations that evaluate the feasible-first mapping stage in isolation, the warm-start TPE stage in isolation, and the full hybrid; and quantitative breakdowns of the mapping phase (budget fraction allocated to exploration, per-subspace success rates in identifying feasible regions, and explicit checks that high-value configurations discovered during mapping are retained for the exploitation phase). These revisions will enable direct verification of the reduced wasted-budget claim and confirm that the hybrid does not miss high-value regions. revision: yes
Circularity Check
No circularity: new algorithmic procedure with no derivations or self-referential fits
full rationale
The paper introduces Thermal Budget Annealing (TBA) as an explicit feasible-first exploration stage (with timeouts and subspace blacklisting) that maps valid regions before warm-starting TPE. No equations, fitted parameters, predictions derived from prior fits, or self-citations appear in the provided text that would reduce the central claims to tautologies or inputs by construction. The method is presented as a heuristic procedure evaluated on DeployBench and real GPU tasks; the derivation chain is self-contained and independent of the target results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gulavani, Ramachandran Ramjee, and Alexey Tumanov
Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. In Proceedings of Machine Learning and Systems (MLSys), 2024
work page 2024
-
[2]
Optuna: A next-generation hyperparameter optimization framework
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In KDD, 2019
work page 2019
-
[3]
Awad, Neeratyoy Mallik, and Frank Hutter
Nikhil H. Awad, Neeratyoy Mallik, and Frank Hutter. DEHB : Evolutionary H yperband for scalable, robust and efficient hyperparameter optimization. In IJCAI, 2021
work page 2021
-
[4]
Algorithms for hyper-parameter optimization
James Bergstra, R\'emi Bardenet, Yoshua Bengio, and Bal\'azs K \'e gl. Algorithms for hyper-parameter optimization. In NeurIPS, 2011
work page 2011
-
[5]
Random search for hyper-parameter optimization
James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281--305, 2012
work page 2012
-
[6]
TVM : An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Lydia Wang, Yaoze Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM : An automated end-to-end optimizing compiler for deep learning. In OSDI, 2018
work page 2018
-
[7]
HPOBench : A collection of reproducible multi-fidelity benchmark problems for HPO
Katharina Eggensperger, Philipp M\"uller, Neeratyoy Mallik, et al. HPOBench : A collection of reproducible multi-fidelity benchmark problems for HPO . In NeurIPS Datasets and Benchmarks Track, 2021
work page 2021
-
[8]
BOHB : Robust and efficient hyperparameter optimization at scale
Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In ICML, 2018
work page 2018
-
[9]
Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and John P. Cunningham. Bayesian optimization with inequality constraints. In ICML, 2014
work page 2014
-
[10]
Roman Garnett. Bayesian Optimization. Cambridge University Press, 2023
work page 2023
-
[11]
Gelbart, Jasper Snoek, and Ryan P
Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In UAI, 2014
work page 2014
-
[12]
Completely derandomized self-adaptation in evolution strategies
Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159--195, 2001
work page 2001
-
[13]
A general framework for constrained Bayesian optimization using information-based search
Jos\'e Miguel Hern\'andez-Lobato, Michael Gelbart, Ryan Adams, Matthew Hoffman, and Zoubin Ghahramani. A general framework for constrained Bayesian optimization using information-based search. Journal of Machine Learning Research, 17(160):1--53, 2016
work page 2016
-
[14]
Imagenette: A smaller subset of 10 easily classified classes from ImageNet
Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from ImageNet . https://github.com/fastai/imagenette, 2020
work page 2020
-
[15]
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.) Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019
work page 2019
-
[16]
Edge Impulse: An MLOps platform for tiny machine learning
Shawn Hymel, Colby Banbury, Daniel Situnayake, et al. Edge Impulse: An MLOps platform for tiny machine learning. In Proceedings of Machine Learning and Systems (MLSys), 2023
work page 2023
-
[17]
Very fast simulated re-annealing
Lester Ingber. Very fast simulated re-annealing. Mathematical and Computer Modelling, 12(8):967--973, 1989
work page 1989
-
[18]
Scott Kirkpatrick, C. Daniel Gelatt, and Mario P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671--680, 1983
work page 1983
-
[19]
Hyperband: A novel bandit-based approach to hyperparameter optimization
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1--52, 2017
work page 2017
-
[20]
SMAC3 : A versatile Bayesian optimization package for hyperparameter optimization
Marius Lindauer, Katharina Eggensperger, Matthias Feurer, et al. SMAC3 : A versatile Bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54):1--9, 2022
work page 2022
-
[21]
YAHPO Gym : An efficient multi-objective multi-fidelity benchmark for hyperparameter optimization
Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder, and Bernd Bischl. YAHPO Gym : An efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. In AutoML Conference, 2022
work page 2022
-
[22]
Vijay Janapa Reddi et al. MLPerf inference benchmark. In ISCA, 2020
work page 2020
-
[23]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In NeurIPS, 2012
work page 2012
-
[24]
D \"u rholt, Payel Das, Jie Chen, Wojciech Matusik, and Mina Konakovi \'c Lukovi \'c
Yunsheng Tian, Ane Zuniga, Xinwei Zhang, Johannes P. D \"u rholt, Payel Das, Jie Chen, Wojciech Matusik, and Mina Konakovi \'c Lukovi \'c . Boundary exploration for Bayesian optimization with unknown physical constraints. In ICML, 2024
work page 2024
-
[25]
Shuhei Watanabe and Frank Hutter. c- TPE : Tree-structured Parzen estimator with inequality constraints for expensive hyperparameter optimization. In IJCAI, 2023
work page 2023
-
[26]
The shift to compound AI systems
Matei Zaharia, Omar Khattab, Lingjiao Chen, et al. The shift to compound AI systems. Berkeley AI Research Blog, 2024
work page 2024
-
[27]
Ansor: Generating high-performance tensor programs for deep learning
Lianmin Zheng, Chengfan Jia, Minmin Sun, et al. Ansor: Generating high-performance tensor programs for deep learning. In OSDI, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.