pith. sign in

arxiv: 2604.25073 · v1 · submitted 2026-04-27 · 💻 cs.LG

Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces

Pith reviewed 2026-05-08 03:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine learning deploymentconstrained optimizationhierarchical search spacesfeasible-first explorationTree-structured Parzen Estimatorscrash-prone evaluationsmodel deployment optimizationblack-box optimization
0
0 comments X

The pith

Thermal Budget Annealing maps feasible regions first to cut wasted trials when optimizing ML deployments in crash-prone hierarchical spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning model deployment requires simultaneous choices over model family, quantization, backend, and serving settings, creating a hierarchical search space full of invalid configurations that crash or violate constraints. Standard black-box methods like TPE waste many of a limited number of evaluations on these invalid points in hostile spaces. The paper introduces Thermal Budget Annealing as an explicit feasible-first exploration phase that locates valid regions before handing control to warm-started TPE, supported by early timeouts and subspace blacklisting for robustness. Tests on synthetic cases and real GPU hardware with five vision models across five NVIDIA targets show higher rates of valid model-family discovery and lower wasted budget than cold-start TPE.

Core claim

The paper claims that in hierarchical mixed-variable deployment spaces where valid configurations are rare, an explicit feasible-first exploration stage called Thermal Budget Annealing outperforms cold-start TPE by first mapping valid and feasible regions, then warm-starting the optimizer, with added mechanisms of trial timeouts to abort infeasible runs early and subspace blacklisting to suppress repeatedly failing categorical subspaces.

What carries the argument

Thermal Budget Annealing (TBA), a feasible-first exploration procedure that maps valid and feasible regions before warm-starting TPE, using trial timeouts and subspace blacklisting for hostile hardware.

If this is right

  • A larger share of the evaluation budget reaches valid model-family configurations under tight latency and memory limits.
  • Early timeouts and blacklisting reduce the cost of each wasted trial in spaces with hidden crash zones.
  • The approach applies directly to joint search over model family, quantization scheme, runtime backend, and serving configuration.
  • DeployBench supplies a repeatable testbed with hierarchical structure, unequal evaluation costs, and hidden invalid regions for comparing such methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feasible-first mapping step could transfer to other black-box problems with sparse feasible sets, such as circuit design or molecular property optimization.
  • Blacklisting repeated failures in categorical subspaces may generalize to adaptive pruning techniques in broader hierarchical optimization.
  • The hybrid structure suggests testing TBA as a front-end for other model-based optimizers beyond TPE in mixed-variable settings.

Load-bearing premise

That an explicit feasible-first exploration stage followed by warm-started TPE will outperform cold-start TPE without the mapping phase missing high-value regions or the blacklisting discarding useful subspaces.

What would settle it

Run TBA and cold-start TPE on a synthetic hierarchical space engineered with small, disconnected feasible pockets and measure whether TBA still reduces the fraction of budget spent on invalid trials or instead wastes early evaluations on failed mapping attempts.

Figures

Figures reproduced from arXiv: 2604.25073 by Christian Lysenst{\o}en.

Figure 1
Figure 1. Figure 1: Best feasible objective versus evaluation budget on Crashy Branin (10 seeds, mean view at source ↗
Figure 2
Figure 2. Figure 2: Wasted budget fraction versus evaluation budget on Crashy Branin. The hybrid view at source ↗
Figure 3
Figure 3. Figure 3: Edge-tight deployment results on RTX 5080: accuracy and wasted budget fraction by view at source ↗
Figure 4
Figure 4. Figure 4: Model family convergence per seed under edge-tight constraints on RTX 5080. Each view at source ↗
Figure 5
Figure 5. Figure 5: Discovery rate and wasted budget fraction across five GPU targets. The hybrid view at source ↗
read the original abstract

Deploying machine learning models under production constraints requires joint optimization over model family, quantization scheme, runtime backend, and serving configuration. This induces a hierarchical mixed-variable search space in which many configurations are invalid: evaluations may crash, exceed memory limits, or violate latency constraints. Standard black-box optimizers such as Tree-structured Parzen Estimators (TPE) and constrained Bayesian optimization are effective when valid configurations are common, but they can spend a large fraction of a small evaluation budget on invalid or uninformative trials in hostile deployment spaces. This paper studies that regime and asks whether optimization should be decomposed into an explicit exploration stage followed by model-guided exploitation. We propose Thermal Budget Annealing (TBA), a feasible-first exploration procedure that maps valid and feasible regions before warm-starting TPE. The method includes two robustness mechanisms for hostile hardware: trial timeouts that abort clearly infeasible evaluations early, and subspace blacklisting that temporarily suppresses categorical subspaces after repeated failures. We also introduce DeployBench, a benchmark suite for deployment optimization with hierarchical structure, hidden crash zones, hard constraints, and unequal evaluation costs. On synthetic benchmarks and real GPU deployment with five pre-trained vision models across five GPU targets (NVIDIA H100, A100, RTX 5080, L4, and T4), the proposed hybrid improves model-family discovery under tight constraints while reducing wasted budget relative to cold-start TPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Thermal Budget Annealing (TBA), a feasible-first exploration procedure that first maps valid and feasible regions in crash-prone hierarchical mixed-variable spaces for ML deployment optimization before warm-starting Tree-structured Parzen Estimators (TPE). It incorporates trial timeouts and subspace blacklisting as robustness mechanisms for hostile hardware, introduces the DeployBench benchmark suite with hierarchical structure and hidden crash zones, and reports empirical gains in model-family discovery under tight constraints plus reduced wasted budget versus cold-start TPE on synthetic benchmarks and real GPU tests involving five pre-trained vision models across NVIDIA H100, A100, RTX 5080, L4, and T4 targets.

Significance. If the central empirical claims hold after addressing the noted gaps, the decomposition into explicit feasible-first mapping followed by warm-started exploitation could offer a useful practical strategy for budget-constrained optimization in deployment settings where invalid configurations dominate. The DeployBench suite represents a concrete contribution that may enable standardized evaluation of methods handling unequal evaluation costs and constraint violations.

major comments (2)
  1. [Subspace blacklisting mechanism (Methods)] Subspace blacklisting mechanism (Methods): The procedure suppresses categorical subspaces after repeated failures but provides no bound on the false-positive blacklisting rate, no reactivation schedule, and no recovery analysis for subspaces that may contain rare valid configurations (e.g., specific quantization+backend pairs). This is load-bearing for the central claim, as irreversible blacklisting within the evaluation budget would prevent the subsequent warm-started TPE from recovering missed optima and could violate the assumption that the hybrid safely identifies valid regions without discarding useful subspaces.
  2. [Experimental results (Experiments section)] Experimental results (Experiments section): The reported improvements on synthetic benchmarks and real GPU deployments lack error bars, ablations isolating the feasible-first mapping stage from the warm-start TPE component, and quantitative details on how the mapping phase allocates budget or avoids missing high-value regions. Without these, the claim of reduced wasted budget relative to cold-start TPE cannot be fully verified and the weakest assumption (that the hybrid does not miss high-value regions) remains untested.
minor comments (2)
  1. [Abstract] Abstract: The abstract states empirical improvements but does not reference specific tables, figures, or quantitative metrics (e.g., wasted budget fractions or discovery rates); cross-reference these explicitly for clarity.
  2. [Methods] Notation and reproducibility: Ensure the full methods section includes pseudocode for TBA, explicit definitions of the hierarchical space variables, and the exact criteria for trial timeouts and blacklisting triggers to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the subspace blacklisting mechanism and the experimental validation that we will strengthen in revision. We respond point by point below.

read point-by-point responses
  1. Referee: [Subspace blacklisting mechanism (Methods)] Subspace blacklisting mechanism (Methods): The procedure suppresses categorical subspaces after repeated failures but provides no bound on the false-positive blacklisting rate, no reactivation schedule, and no recovery analysis for subspaces that may contain rare valid configurations (e.g., specific quantization+backend pairs). This is load-bearing for the central claim, as irreversible blacklisting within the evaluation budget would prevent the subsequent warm-started TPE from recovering missed optima and could violate the assumption that the hybrid safely identifies valid regions without discarding useful subspaces.

    Authors: We agree that the current Methods description would benefit from greater precision on the blacklisting procedure. The manuscript already characterizes blacklisting as temporary suppression after repeated failures within a subspace, but we will revise the text to explicitly state the reactivation schedule (reactivation occurs after a cooldown interval scaled by the current annealing temperature, allowing re-exploration of previously suppressed subspaces). We will also add an empirical recovery analysis in the Experiments section, reporting the fraction of blacklisted subspaces that are later successfully re-sampled and confirming that high-value feasible configurations are recovered before the TPE exploitation phase. While we do not claim a theoretical bound on the false-positive blacklisting rate (the mechanism is a practical heuristic for hostile hardware rather than a provably safe filter), the combination of temporary suppression, trial timeouts, and subsequent warm-started TPE ensures that the hybrid does not permanently discard useful subspaces. These clarifications and added analyses will directly address the load-bearing concern. revision: yes

  2. Referee: [Experimental results (Experiments section)] Experimental results (Experiments section): The reported improvements on synthetic benchmarks and real GPU deployments lack error bars, ablations isolating the feasible-first mapping stage from the warm-start TPE component, and quantitative details on how the mapping phase allocates budget or avoids missing high-value regions. Without these, the claim of reduced wasted budget relative to cold-start TPE cannot be fully verified and the weakest assumption (that the hybrid does not miss high-value regions) remains untested.

    Authors: We concur that the experimental presentation requires these additions to fully substantiate the claims. In the revised manuscript we will include: error bars on all performance metrics computed over at least five independent random seeds; dedicated ablations that evaluate the feasible-first mapping stage in isolation, the warm-start TPE stage in isolation, and the full hybrid; and quantitative breakdowns of the mapping phase (budget fraction allocated to exploration, per-subspace success rates in identifying feasible regions, and explicit checks that high-value configurations discovered during mapping are retained for the exploitation phase). These revisions will enable direct verification of the reduced wasted-budget claim and confirm that the hybrid does not miss high-value regions. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic procedure with no derivations or self-referential fits

full rationale

The paper introduces Thermal Budget Annealing (TBA) as an explicit feasible-first exploration stage (with timeouts and subspace blacklisting) that maps valid regions before warm-starting TPE. No equations, fitted parameters, predictions derived from prior fits, or self-citations appear in the provided text that would reduce the central claims to tautologies or inputs by construction. The method is presented as a heuristic procedure evaluated on DeployBench and real GPU tasks; the derivation chain is self-contained and independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are detailed. The method implicitly assumes hierarchical search spaces contain mappable feasible regions and that early termination and blacklisting preserve optimization progress.

pith-pipeline@v0.9.0 · 5550 in / 1156 out tokens · 30386 ms · 2026-05-08T03:40:31.879945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Gulavani, Ramachandran Ramjee, and Alexey Tumanov

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. In Proceedings of Machine Learning and Systems (MLSys), 2024

  2. [2]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In KDD, 2019

  3. [3]

    Awad, Neeratyoy Mallik, and Frank Hutter

    Nikhil H. Awad, Neeratyoy Mallik, and Frank Hutter. DEHB : Evolutionary H yperband for scalable, robust and efficient hyperparameter optimization. In IJCAI, 2021

  4. [4]

    Algorithms for hyper-parameter optimization

    James Bergstra, R\'emi Bardenet, Yoshua Bengio, and Bal\'azs K \'e gl. Algorithms for hyper-parameter optimization. In NeurIPS, 2011

  5. [5]

    Random search for hyper-parameter optimization

    James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281--305, 2012

  6. [6]

    TVM : An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Lydia Wang, Yaoze Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM : An automated end-to-end optimizing compiler for deep learning. In OSDI, 2018

  7. [7]

    HPOBench : A collection of reproducible multi-fidelity benchmark problems for HPO

    Katharina Eggensperger, Philipp M\"uller, Neeratyoy Mallik, et al. HPOBench : A collection of reproducible multi-fidelity benchmark problems for HPO . In NeurIPS Datasets and Benchmarks Track, 2021

  8. [8]

    BOHB : Robust and efficient hyperparameter optimization at scale

    Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In ICML, 2018

  9. [9]

    Gardner, Matt J

    Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and John P. Cunningham. Bayesian optimization with inequality constraints. In ICML, 2014

  10. [10]

    Bayesian Optimization

    Roman Garnett. Bayesian Optimization. Cambridge University Press, 2023

  11. [11]

    Gelbart, Jasper Snoek, and Ryan P

    Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In UAI, 2014

  12. [12]

    Completely derandomized self-adaptation in evolution strategies

    Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159--195, 2001

  13. [13]

    A general framework for constrained Bayesian optimization using information-based search

    Jos\'e Miguel Hern\'andez-Lobato, Michael Gelbart, Ryan Adams, Matthew Hoffman, and Zoubin Ghahramani. A general framework for constrained Bayesian optimization using information-based search. Journal of Machine Learning Research, 17(160):1--53, 2016

  14. [14]

    Imagenette: A smaller subset of 10 easily classified classes from ImageNet

    Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from ImageNet . https://github.com/fastai/imagenette, 2020

  15. [15]

    Springer, 2019

    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.) Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019

  16. [16]

    Edge Impulse: An MLOps platform for tiny machine learning

    Shawn Hymel, Colby Banbury, Daniel Situnayake, et al. Edge Impulse: An MLOps platform for tiny machine learning. In Proceedings of Machine Learning and Systems (MLSys), 2023

  17. [17]

    Very fast simulated re-annealing

    Lester Ingber. Very fast simulated re-annealing. Mathematical and Computer Modelling, 12(8):967--973, 1989

  18. [18]

    Daniel Gelatt, and Mario P

    Scott Kirkpatrick, C. Daniel Gelatt, and Mario P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671--680, 1983

  19. [19]

    Hyperband: A novel bandit-based approach to hyperparameter optimization

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1--52, 2017

  20. [20]

    SMAC3 : A versatile Bayesian optimization package for hyperparameter optimization

    Marius Lindauer, Katharina Eggensperger, Matthias Feurer, et al. SMAC3 : A versatile Bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54):1--9, 2022

  21. [21]

    YAHPO Gym : An efficient multi-objective multi-fidelity benchmark for hyperparameter optimization

    Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder, and Bernd Bischl. YAHPO Gym : An efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. In AutoML Conference, 2022

  22. [22]

    MLPerf inference benchmark

    Vijay Janapa Reddi et al. MLPerf inference benchmark. In ISCA, 2020

  23. [23]

    Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In NeurIPS, 2012

  24. [24]

    D \"u rholt, Payel Das, Jie Chen, Wojciech Matusik, and Mina Konakovi \'c Lukovi \'c

    Yunsheng Tian, Ane Zuniga, Xinwei Zhang, Johannes P. D \"u rholt, Payel Das, Jie Chen, Wojciech Matusik, and Mina Konakovi \'c Lukovi \'c . Boundary exploration for Bayesian optimization with unknown physical constraints. In ICML, 2024

  25. [25]

    c- TPE : Tree-structured Parzen estimator with inequality constraints for expensive hyperparameter optimization

    Shuhei Watanabe and Frank Hutter. c- TPE : Tree-structured Parzen estimator with inequality constraints for expensive hyperparameter optimization. In IJCAI, 2023

  26. [26]

    The shift to compound AI systems

    Matei Zaharia, Omar Khattab, Lingjiao Chen, et al. The shift to compound AI systems. Berkeley AI Research Blog, 2024

  27. [27]

    Ansor: Generating high-performance tensor programs for deep learning

    Lianmin Zheng, Chengfan Jia, Minmin Sun, et al. Ansor: Generating high-performance tensor programs for deep learning. In OSDI, 2020