pith. sign in

arxiv: 1906.12230 · v1 · pith:WPJYGY5Dnew · submitted 2019-06-28 · 💻 cs.LG · cs.CL· stat.ML

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

Pith reviewed 2026-05-25 13:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords model selectionbandit algorithmsadaptive evaluationconfidence guaranteesmachine learningsentiment analysisstochastic optimization
0
0 comments X

The pith

FIESTA uses adaptive bandit algorithms to identify the best model from many candidates with fewer evaluations and statistical guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FIESTA is a model selection method that applies ideas from bandit algorithms to decide how many times to evaluate each candidate model on different data splits and random seeds. It spends more effort on models that look promising and less on those that do not, while still giving a guarantee that the chosen model is the true best one with high probability. This addresses the common but unreliable practice of comparing models on single evaluations or single splits. A reader would care because selecting among dozens of models can require thousands of training runs, and this approach cuts that cost substantially without sacrificing reliability.

Core claim

The paper introduces FIESTA as an approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models by adaptively determining the numbers of data splits and random seeds for each model using bandit theory, with a Python implementation that produces confidence guarantees of correctly selecting the optimal model. It demonstrates this by selecting among 8 sentiment analysis methods using dramatically fewer evaluations.

What carries the argument

The FIESTA algorithm, which adaptively allocates evaluations across models and data splits using bandit algorithms to focus resources on promising models while maintaining selection guarantees.

If this is right

  • Reliable identification of the top model requires multiple train-test splits in addition to multiple random seeds.
  • Computational cost of model selection drops substantially compared to fixed evaluation budgets.
  • The method provides explicit probabilistic guarantees on selecting the correct optimal model.
  • The approach is demonstrated on target-dependent sentiment analysis but applies generally to model collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be extended to hyperparameter tuning or neural architecture search where many variants need comparison.
  • It suggests that shared task evaluations should mandate multiple splits to avoid unreliable rankings.
  • Practitioners might integrate it into standard model comparison workflows to save resources.

Load-bearing premise

The stochastic performance estimates from different random seeds and train-test splits must satisfy the concentration and reward assumptions needed by the bandit algorithms for the adaptive allocation and guarantees to hold.

What would settle it

Running the algorithm on a collection of models where the true best model is known from exhaustive evaluation, and checking whether it selects a different model more often than the claimed error probability allows.

Figures

Figures reproduced from arXiv: 1906.12230 by Andrew Moore, David S. Leslie, Henry B. Moss, Paul Rayson.

Figure 1
Figure 1. Figure 1: The left plot shows the distribution of results [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TTTS seeking the optimal model with confi [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: F1 scores for our candidate TDSA models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proportion of the runs correctly selecting the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random seeds. We show that reliable model selection also requires evaluations based on multiple train-test splits (contrary to common practice in many shared tasks). Using bandit theory from the statistics literature, we are able to adaptively determine appropriate numbers of data splits and random seeds used to evaluate each model, focusing computational resources on the evaluation of promising models whilst avoiding wasting evaluations on models with lower performance. Furthermore, our user-friendly Python implementation produces confidence guarantees of correctly selecting the optimal model. We evaluate our algorithms by selecting between 8 target-dependent sentiment analysis methods using dramatically fewer model evaluations than current model selection approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FIESTA, a model selection method that applies adaptive bandit algorithms (drawing on UCB and successive-elimination variants) to identify the best model from a collection of candidates. It argues that reliable selection requires multiple random seeds and train-test splits (rather than single evaluations), adaptively allocates evaluations to focus resources on promising models, and supplies PAC-style confidence guarantees on correct selection of the optimum. The approach is evaluated by selecting among 8 target-dependent sentiment analysis models, claiming substantially fewer total evaluations than exhaustive or fixed-budget baselines.

Significance. If the performance estimates satisfy the requisite concentration assumptions, the work would supply a practical, statistically justified alternative to the common single-seed practice, with an open Python implementation that directly outputs selection guarantees. This addresses a genuine pain point in large-scale model comparison.

major comments (2)
  1. [theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.
  2. [method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.
minor comments (2)
  1. [abstract] The abstract states that the method 'significantly reduces the computational resources' but supplies no numerical comparison (e.g., total evaluations or wall-clock time) against the baselines used in the experiments.
  2. [preliminaries] Notation for the number of splits versus seeds is introduced without an explicit table or equation linking them to the bandit horizon parameter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the importance of verifying the assumptions underlying our theoretical guarantees. We address each major comment below and will incorporate clarifications and diagnostics in the revised manuscript.

read point-by-point responses
  1. Referee: [theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.

    Authors: We agree that the PAC guarantees require the stated assumptions to hold for the observed accuracies. Each accuracy is bounded in [0,1] and thus 1/4-sub-Gaussian; independent random seeds and splits yield i.i.d. samples under standard assumptions. However, the manuscript does not include empirical diagnostics. In revision we will add a short subsection with variance histograms and normality checks on the per-model reward distributions from the sentiment experiments to support applicability of the bounds. revision: yes

  2. Referee: [method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.

    Authors: The mapping treats each (seed, split) evaluation as a direct [0,1]-bounded reward. Because seeds and splits are drawn independently, the sequence of rewards for each arm remains i.i.d. and bounded. We will revise the method section to state this mapping explicitly, note the boundedness, and reference the standard sub-Gaussian property of [0,1] random variables, thereby supporting the application of UCB and successive-elimination results. revision: yes

Circularity Check

0 steps flagged

No circularity; applies external bandit theory to model selection

full rationale

The derivation applies standard bandit algorithms (UCB, successive elimination) from the statistics literature to allocate evaluations across models, seeds, and splits, with PAC-style guarantees following directly from the cited external concentration results under the paper's stated i.i.d. sub-Gaussian assumptions. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citations appear, and the central claim of reduced evaluations with valid guarantees rests on independent statistical theory rather than internal redefinition or renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on standard statistical assumptions from bandit literature being applicable to model performance scores; no new entities are introduced.

free parameters (1)
  • bandit hyperparameters (e.g., exploration parameter)
    Typical bandit algorithms require at least one tunable parameter that controls the exploration-exploitation trade-off; its value is not specified in the abstract.
axioms (1)
  • domain assumption Performance scores from different seeds and splits behave as stochastic rewards obeying the concentration inequalities required by the bandit algorithms used.
    This assumption is necessary for the adaptive allocation and the claimed confidence guarantees to hold.

pith-pipeline@v0.9.0 · 5692 in / 1210 out tokens · 36485 ms · 2026-05-25T13:31:31.540450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    Andrea Arcuri and Lionel Briand. 2014. https://doi.org/10.1002/stvr.1486 A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering . Software Testing, Verification and Reliability, 24(3):219--250

  2. [2]

    Jean-Yves Audibert and S \'e bastien Bubeck. 2010. https://hal-enpc.archives-ouvertes.fr/hal-00654404/file/COLT10.pdf Best arm identification in multi-armed bandits . In COLT - 23th Conference on Learning Theory - 2010, pages 13--p

  3. [3]

    James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer Science & Business Media

  4. [4]

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. http://aclweb.org/anthology/P18-1128 The hitchhiker's guide to testing statistical significance in natural language processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383--1392. Association for Computational ...

  5. [5]

    Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press

  6. [6]

    Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255--270. Springer

  7. [7]

    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Springer series in statistics New York, NY, USA:

  8. [8]

    Yarin Gal and Zoubin Ghahramani. 2016. http://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf A theoretically grounded application of dropout in recurrent neural networks . In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing System...

  9. [9]

    Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer

    Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. http://aclweb.org/anthology/W18-2501 Allennlp: A deep semantic natural language processing platform . In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1--6. Association for Computational...

  10. [10]

    Aur \'e lien Garivier and Emilie Kaufmann. 2016. http://proceedings.mlr.press/v49/garivier16a.pdf Optimal best arm identification with fixed confidence . In Conference on Learning Theory, pages 998--1027

  11. [11]

    Gholamreza Haffari, Tuan Dung Tran, and Mark Carman. 2017. http://aclweb.org/anthology/E17-1039 Efficient benchmarking of nlp apis using multi-armed bandits . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 408--416. Association for Computational Linguistics

  12. [12]

    Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. https://arxiv.org/pdf/1207.0580.pdf Improving neural networks by preventing co-adaptation of feature detectors . arXiv preprint arXiv:1207.0580

  13. [13]

    Junya Honda and Akimichi Takemura. 2014. http://proceedings.mlr.press/v33/honda14.pdf Optimality of thompson sampling for gaussian bandits depends on priors . In Artificial Intelligence and Statistics, pages 375--383

  14. [14]

    Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. 2013. https://arxiv.org/pdf/1306.3917.pdf On finding the largest mean among many . arXiv preprint arXiv:1306.3917

  15. [15]

    Kevin Jamieson, Matthew Malloy, Robert Nowak, and S \'e bastien Bubeck. 2014. http://proceedings.mlr.press/v35/jamieson14.pdf lil’ucb: An optimal exploration algorithm for multi-armed bandits . In Conference on Learning Theory, pages 423--439

  16. [16]

    Kevin Jamieson and Robert Nowak. 2014. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6814096 Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting . In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE

  17. [17]

    Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab \'a s P \'o czos. 2018. http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf Parallelised bayesian optimisation via thompson sampling . In International Conference on Artificial Intelligence and Statistics

  18. [18]

    Zohar Karnin, Tomer Koren, and Oren Somekh. 2013. http://proceedings.mlr.press/v28/karnin13.pdf Almost optimal exploration in multi-armed bandits . In International Conference on Machine Learning, pages 1238--1246

  19. [19]

    Diederik P Kingma and Jimmy Ba. 2014. https://arxiv.org/pdf/1412.6980.pdf Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980

  20. [20]

    Ron Kohavi. 1995. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf A study of cross-validation and bootstrap for accuracy estimation and model selection . In Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2, pages 1137--1143. Morgan Kaufmann Publishers Inc

  21. [21]

    Tze Leung Lai and Herbert Robbins. 1985. https://core.ac.uk/download/pdf/82425825.pdf Asymptotically efficient adaptive allocation rules . Advances in applied mathematics, 6(1):4--22

  22. [22]

    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. https://doi.org/10.18653/v1/N16-1030 Neural architectures for named entity recognition . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260--270. Associa...

  23. [23]

    Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. https://doi.org/10.18653/v1/D17-1272 Counterfactual learning from bandit feedback under deterministic logging : A case study in statistical machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566--2576. Association for Computation...

  24. [24]

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. https://www.wwwconference.org/proceedings/www2010/www/p661.pdf A contextual-bandit approach to personalized news article recommendation . In Proceedings of the 19th international conference on World wide web, pages 661--670. ACM

  25. [25]

    Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. https://www.ijcai.org/proceedings/2017/0568.pdf Interactive attention networks for aspect-level sentiment classification . In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4068--4074. AAAI Press

  26. [26]

    Xuezhe Ma and Eduard Hovy. 2016. https://doi.org/10.18653/v1/P16-1101 End-to-end sequence labeling via bi-directional lstm-cnns-crf . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064--1074. Association for Computational Linguistics

  27. [27]

    Shie Mannor and John N Tsitsiklis. 2004. http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf The sample complexity of exploration in the multi-armed bandit problem . Journal of Machine Learning Research, 5(Jun):623--648

  28. [28]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems, pages 3111--3119

  29. [29]

    Henry Moss, David Leslie, and Paul Rayson. 2018. http://aclweb.org/anthology/C18-1252 Using j-k-fold cross validation to reduce variance when tuning nlp models . In Proceedings of the 27th International Conference on Computational Linguistics, pages 2978--2989. Association for Computational Linguistics

  30. [30]

    Khanh Nguyen, Hal Daum \'e III, and Jordan Boyd-Graber. 2017. https://doi.org/10.18653/v1/D17-1153 Reinforcement learning for bandit neural machine translation with simulated human feedback . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464--1474. Association for Computational Linguistics

  31. [31]

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. https://openreview.net/pdf?id=BJJsrmfCZ Automatic differentiation in pytorch . In NIPS-W

  32. [32]

    Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543. Association for Computational Linguistics

  33. [33]

    Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. https://doi.org/10.3115/v1/S14-2004 Semeval-2014 task 4: Aspect based sentiment analysis . In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27--35. Association for Computational Linguistics

  34. [34]

    Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338--348. Association for Computational Linguistics

  35. [35]

    Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/pdf/1803.09578.pdf Why comparing single performance scores does not allow to draw conclusions about machine learning approaches . arXiv preprint arXiv:1803.09578

  36. [36]

    Daniel Russo. 2016. http://proceedings.mlr.press/v49/russo16.pdf Simple bayesian algorithms for best arm identification . In Conference on Learning Theory, pages 1417--1418

  37. [37]

    Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. https://djrusso.github.io/RLCourse/papers/TS_Tutorial.pdf A tutorial on thompson sampling . Foundations and Trends in Machine Learning, 11(1):1--96

  38. [38]

    SemEval. 2018. https://doi.org/10.18653/v1/S18-1 Proceedings of The 12th International Workshop on Semantic Evaluation . Association for Computational Linguistics, New Orleans, Louisiana

  39. [39]

    Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016. https://doi.org/10.18653/v1/P16-1152 Learning structured predictors from bandit feedback for interactive nlp . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610--1620. Association for Computational Linguistics

  40. [40]

    Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016. http://aclweb.org/anthology/C16-1311 Effective lstms for target-dependent sentiment classification . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3298--3307. The COLING 2016 Organizing Committee

  41. [41]

    Sof \' a S Villar, Jack Bowden, and James Wason. 2015. https://doi.org/10.1214/14-STS504 Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges . Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199

  42. [42]

    Yequan Wang, Minlie Huang, xiaoyan zhu, and Li Zhao. 2016. https://doi.org/10.18653/v1/D16-1058 Attention-based lstm for aspect-level sentiment classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606--615. Association for Computational Linguistics

  43. [43]

    Jie Yang, Shuailong Liang, and Yue Zhang. 2018. http://aclweb.org/anthology/C18-1327 Design challenges and misconceptions in neural sequence labeling . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879--3889. Association for Computational Linguistics

  44. [44]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  45. [45]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...