pith. sign in

arxiv: 1907.00678 · v1 · pith:DRQUBKZ4new · submitted 2019-07-01 · 💻 cs.LG · cs.AI

Two-stage Optimization for Machine Learning Workflow

Pith reviewed 2026-05-25 12:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords autoMLdata pipelineshyperparameter tuningmachine learning workflowstwo-stage optimizationtime allocationpipeline specificity
0
0 comments X

The pith

A two-stage optimization builds data pipelines before configuring algorithms and finds preprocessing has larger impact than hyperparameter tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage optimization process for machine learning workflows that first constructs data pipelines and then tunes algorithm settings. Experiments compare the two stages to establish that data preprocessing steps often influence final model quality more than adjustments to algorithm parameters. Time-allocation policies are given to divide search effort between the stages in a way that does not depend on any particular meta-optimizer. A metric is also defined to measure whether a given pipeline is tied to one algorithm or works independently, which supports pruning and cold-start meta-learning. These elements together aim to reduce manual work in building production machine learning systems.

Core claim

The paper claims that data pipeline construction contributes more to model performance than algorithm configuration, that time can be split efficiently between the two stages using agnostic policies, and that a pipeline-algorithm specificity metric enables targeted pruning and meta-learning.

What carries the argument

Two-stage optimization process that separates data pipeline search from algorithm configuration, together with time-allocation policies and a pipeline specificity metric.

If this is right

  • Machine learning model building can be automated by allocating more search resources to pipeline construction than to parameter tuning.
  • Time-allocation policies can be used with any meta-optimizer to balance the two stages without redesign.
  • The specificity metric supports removal of algorithm-dependent pipelines and transfer of pipelines across algorithms for faster cold starts.
  • Production deployment of machine learning becomes more scalable when pipeline search is treated as the primary stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of pipeline and configuration stages could be tested in domains outside standard supervised learning such as reinforcement learning or time-series forecasting.
  • The policies might be adapted to dynamic time budgets that change during a single run based on early performance signals.
  • Combining the specificity metric with existing pipeline libraries could reduce redundant searches across many algorithms.

Load-bearing premise

The observed greater impact of data pipelines over algorithm configuration, along with the effectiveness of the time policies, will hold for datasets and meta-optimizers beyond those used in the experiments.

What would settle it

Repeating the experiments on a fresh collection of datasets with a different meta-optimizer and obtaining results where algorithm configuration consistently improves performance more than pipeline search would falsify the central importance claim.

Figures

Figures reproduced from arXiv: 1907.00678 by Alexandre Quemy.

Figure 1
Figure 1. Figure 1: Typical machine learning workflow On one hand, there are plenty of reasons that can explain why a data source cannot be used directly and require preprocessing: too many variables, imbalanced dataset, missing values, outliers, noise, specific domain restriction of the algorithms, etc. On the other hand, data preprocessing has a huge impact on the model performances [4, 3, 5]. 2 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 2
Figure 2. Figure 2: Example of real-life pipelines designed with SAS (left) and IBM Watson Studio [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage optimization process The proposed two-stage optimization process is illustrated by [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Each node can be instantiated with an operator or left empty. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Density of pipeline configurations (left). The vertical line represents the baseline [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy depending on the time spent on each phase of the optimization process. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of the best score in time for different policies. Split 300 and Split 0 are [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Heatmap depicting the accuracy depending on the pipeline parameter configura [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Machines learning techniques plays a preponderant role in dealing with massive amount of data and are employed in almost every possible domain. Building a high quality machine learning model to be deployed in production is a challenging task, from both, the subject matter experts and the machine learning practitioners. For a broader adoption and scalability of machine learning systems, the construction and configuration of machine learning workflow need to gain in automation. In the last few years, several techniques have been developed in this direction, known as autoML. In this paper, we present a two-stage optimization process to build data pipelines and configure machine learning algorithms. First, we study the impact of data pipelines compared to algorithm configuration in order to show the importance of data preprocessing over hyperparameter tuning. The second part presents policies to efficiently allocate search time between data pipeline construction and algorithm configuration. Those policies are agnostic from the metaoptimizer. Last, we present a metric to determine if a data pipeline is specific or independent from the algorithm, enabling fine-grain pipeline pruning and meta-learning for the coldstart problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage optimization process for machine learning workflows. It first empirically studies the relative impact of data pipeline construction versus algorithm hyperparameter configuration to argue for the greater importance of data preprocessing. It then introduces meta-optimizer-agnostic policies for allocating search time between the two stages and defines a metric to assess whether a given data pipeline is algorithm-specific or independent, supporting pruning and meta-learning for cold-start problems.

Significance. If the empirical comparisons and policy evaluations hold under broader conditions, the work could inform more efficient AutoML designs by directing attention to data preprocessing and providing practical, optimizer-independent time-allocation heuristics. The pipeline-specificity metric is a concrete, potentially reusable contribution for meta-learning pipelines.

major comments (2)
  1. [Experiments] Experiments section: the headline claim that data pipelines have greater impact than algorithm configuration rests on results from a fixed collection of datasets and particular meta-optimizers. Without additional cross-domain validation (e.g., on high-dimensional, noisy, or heterogeneous-feature datasets) or sensitivity analysis, the general methodological recommendation does not follow.
  2. [Time Allocation Policies] Time-allocation policies section: the reported effectiveness of the proposed policies is demonstrated only within the same experimental setup; the meta-optimizer-agnostic claim requires explicit tests on at least one additional search algorithm or a different class of meta-optimizer to confirm robustness.
minor comments (2)
  1. [Abstract] Abstract contains grammatical errors ('Machines learning techniques plays' should read 'Machine learning techniques play').
  2. [Metric Definition] Notation for the specificity metric should be introduced with a clear equation or definition before its use in the pruning discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim that data pipelines have greater impact than algorithm configuration rests on results from a fixed collection of datasets and particular meta-optimizers. Without additional cross-domain validation (e.g., on high-dimensional, noisy, or heterogeneous-feature datasets) or sensitivity analysis, the general methodological recommendation does not follow.

    Authors: We agree that the headline claim would be strengthened by broader validation. The experiments use a collection of standard benchmark datasets, but we acknowledge the limitation regarding cross-domain coverage. In the revision we will add a sensitivity analysis subsection and results on additional high-dimensional and heterogeneous-feature datasets to better support the scope of the methodological recommendation. revision: yes

  2. Referee: [Time Allocation Policies] Time-allocation policies section: the reported effectiveness of the proposed policies is demonstrated only within the same experimental setup; the meta-optimizer-agnostic claim requires explicit tests on at least one additional search algorithm or a different class of meta-optimizer to confirm robustness.

    Authors: The policies are formulated without reference to any particular meta-optimizer internals and are therefore intended to be agnostic. Nevertheless, the empirical demonstration was limited to the optimizers used in the study. To confirm robustness we will include results with at least one additional search algorithm (from a different class) in the revised Time Allocation Policies section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with no derivations or self-referential fitting

full rationale

The paper describes a two-stage optimization process and reports experimental comparisons of data-pipeline impact versus algorithm configuration, plus time-allocation policies and a specificity metric. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. All central claims rest on direct empirical measurements rather than any reduction by construction to the paper's own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5700 in / 953 out tokens · 17769 ms · 2026-05-25T12:25:05.121716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

  1. [1]

    Automated Machine Learning: State-of-The-Art and Open Challenges

    R. Elshawi, M. Maher, S. Sakr, Automated machine learning: State-of- the-art and open challenges (2019).arXiv:1906.02287

  2. [2]

    Hutter, L

    F. Hutter, L. Kotthoff, J. Vanschoren, Automatic machine learning: methods, systems, challenges, Challenges in Mach. Learn

  3. [3]

    S. F. Crone, S. Lessmann, R. Stahlbock, The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing, Eur. J. Oper. Res. 173 (3) (2006) 781 – 800

  4. [4]

    T. Dasu, T. Johnson, Exploratory data mining and data cleaning, Vol. 479, John Wiley & Sons, 2003

  5. [5]

    N. M. Nawi, W. H. Atomi, M. Z. Rehman, The effect of data pre- processing on optimized training of artificial neural networks, Procedia Technology 11 (2013) 32 – 39, int. Conf. Elect. Eng. Info

  6. [6]

    D. H. Wolpert, The lack of a priori distinctions between learning algo- rithms, Neural Comput. 8 (7) (1996) 1341–1390

  7. [7]

    Chessell, F

    M. Chessell, F. Scheepers, N. Nguyen, R. van Kessel, R. van der Starre, Governing and managing big data for analytics and decision makers, IBM Redguides for Business Leaders

  8. [8]

    Quemy, Data pipeline selection and optimization, in: Pro

    A. Quemy, Data pipeline selection and optimization, in: Pro. Int. Work- shop on Design, Optim., Languages and Anal. Processing of Big Data, 2019

  9. [9]

    D. C. Montgomery, Design and analysis of experiments, John wiley & sons, 2017. 26

  10. [10]

    Bergstra, Y

    J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13 (Feb) (2012) 281–305

  11. [11]

    Hutter, H

    F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based opti- mization for general algorithm configuration, in: Proc. Int. Conf. Learn. Intel. Optim., Springer-Verlag, Berlin, Heidelberg, 2011, pp. 507–523

  12. [12]

    Bergstra, R

    J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper- parameter optimization, in: Proc. Int. Conf. Neural Inf. Process. Syst., 2011, pp. 2546–2554

  13. [13]

    Thornton, F

    C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-weka: Combined selection and hyperparameter optimization of classification algorithms, in: Int. Conf. Knowl. Disc. Data Min., ACM, 2013, pp. 847–855

  14. [14]

    Kotthoff, C

    L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, K. Leyton-Brown, Auto- weka 2.0: Automatic model selection and hyperparameter optimization in weka, J. Mach. Learn. Res. 18 (1) (2017) 826–830

  15. [15]

    Feurer, A

    M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hut- ter, Efficient and robust automated machine learning, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.), Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2962–2970

  16. [16]

    Bergstra, B

    J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, Hyperopt: a python library for model selection and hyperparameter optimization, Comput. Sci. & Discovery 8 (1) (2015) 014008

  17. [17]

    Snoek, H

    J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algorithms, in: Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 2951–2959

  18. [18]

    Wilson, F

    J. Wilson, F. Hutter, M. Deisenroth, Maximizing acquisition functions for bayesian optimization, in: Proc. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 9884–9895

  19. [19]

    P. I. Frazier, A tutorial on bayesian optimization, arXiv preprint arXiv:1807.02811. 27

  20. [20]

    Močkus, On bayesian methods for seeking the extremum, in: Op- timization Techniques IFIP Technical Conference, Springer, 1975, pp

    J. Močkus, On bayesian methods for seeking the extremum, in: Op- timization Techniques IFIP Technical Conference, Springer, 1975, pp. 400–404

  21. [21]

    Rakotoarison, M

    H. Rakotoarison, M. Sebag, AutoML with Monte Carlo Tree Search, in: Workshop AutoML 2018 @ ICML/IJCAI-ECAI, Pavel Brazdil, Christophe Giraud-Carrier, and Isabelle Guyon, Stockholm, Sweden, 2018

  22. [22]

    Domhan, J

    T. Domhan, J. T. Springenberg, F. Hutter, Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves, in: Int. Conf. Artif. Intel., 2015

  23. [23]

    Freeze-Thaw Bayesian Optimization

    K. Swersky, J. Snoek, R. P. Adams, Freeze-thaw bayesian optimization, arXiv preprint arXiv:1406.3896

  24. [24]

    Jamieson, A

    K. Jamieson, A. Talwalkar, Non-stochastic best arm identification and hyperparameter optimization, in: Artificial Intelligence and Statistics, 2016, pp. 240–248

  25. [25]

    L. Li, K. Jamieson, Hyperband: A novel bandit-based approach to hyperparameter optimization, J. Mach. Learn. Res. 18 (2018) 1–52

  26. [26]

    J.Nalepa, M.Myller, S.Piechaczek, K.Hrynczenko, M.Kawulok, Genetic selection of training sets for (not only) artificial neural networks, in: Proc. Int. Conf. Beyond Databases, Architectures Struct., 2018, pp. 194–206

  27. [27]

    R. S. Olson, N. Bartley, R. J. Urbanowicz, J. H. Moore, Evaluation of a tree-based pipeline optimization tool for automating data science, in: Proc. Gen. and Evol. Comput. Conf., ACM, 2016, pp. 485–492

  28. [28]

    B. Chen, H. Wu, W. Mo, I. Chattopadhyay, H. Lipson, Autostacker: A compositional evolutionary learning system, in: Proc. Gen. and Evol. Comput. Conf., 2018, pp. 402–409

  29. [29]

    X. Sun, J. Lin, B. Bischl, Reinbo: Machine learning pipeline search and configuration with bayesian optimization embedded reinforcement learning, CoRR abs/1904.05381

  30. [30]

    J. Kim, S. Kim, S. Choi, Learning to warm-start bayesian hyperparameter optimization, arXiv preprint arXiv:1710.06219. 28

  31. [31]

    Bilalli, A

    B. Bilalli, A. Abelló, T. Aluja-Banet, On the predictive power of meta- features in openml, Int. J. Appl. Math. Comput. Sci. 27 (4) (2017) 697–712

  32. [32]

    Meta-Learning: A Survey

    J. Vanschoren, Meta-learning: A survey, arXiv preprint arXiv:1810.03548

  33. [33]

    D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980

  34. [34]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830

  35. [35]

    Eggensperger, M

    K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, K. Leyton-Brown, Towards an empirical foundation for assessing bayesian optimization of hyperparameters, in: NIPS workshop on Bayesian Opti- mization in Theory and Practice, 2013

  36. [36]

    Bilalli, A

    B. Bilalli, A. Abelló, T. Aluja-Banet, R. Wrembel, Intelligent assistance for data pre-processing, Computer Standards & Interfaces (2018) 101 – 109

  37. [37]

    Bischl, G

    B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. G. Manto- vani, J. N. van Rijn, J. Vanschoren, Openml benchmarking suites and the openml100, arXiv preprint arXiv:1708.03731

  38. [38]

    Quantifying error contributions of computational steps, algorithms and hyperparameter choices in image classification pipelines

    A. Chowdhury, M. Magdon-Ismail, B. Yener, Quantifying error contribu- tions of computational steps, algorithms and hyperparameter choices in image classification pipelines, CoRR abs/1903.02521. 29 Appendix A. Pipeline configuration space Table A.5: Pipeline search space. #𝜆 |Λ| impl. Rebalance No operator 0 0 - Near Miss 1 3 imblearn Condensed Nearest Neig...