FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms
Pith reviewed 2026-05-25 13:31 UTC · model grok-4.3
The pith
FIESTA uses adaptive bandit algorithms to identify the best model from many candidates with fewer evaluations and statistical guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces FIESTA as an approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models by adaptively determining the numbers of data splits and random seeds for each model using bandit theory, with a Python implementation that produces confidence guarantees of correctly selecting the optimal model. It demonstrates this by selecting among 8 sentiment analysis methods using dramatically fewer evaluations.
What carries the argument
The FIESTA algorithm, which adaptively allocates evaluations across models and data splits using bandit algorithms to focus resources on promising models while maintaining selection guarantees.
If this is right
- Reliable identification of the top model requires multiple train-test splits in addition to multiple random seeds.
- Computational cost of model selection drops substantially compared to fixed evaluation budgets.
- The method provides explicit probabilistic guarantees on selecting the correct optimal model.
- The approach is demonstrated on target-dependent sentiment analysis but applies generally to model collections.
Where Pith is reading between the lines
- The technique could be extended to hyperparameter tuning or neural architecture search where many variants need comparison.
- It suggests that shared task evaluations should mandate multiple splits to avoid unreliable rankings.
- Practitioners might integrate it into standard model comparison workflows to save resources.
Load-bearing premise
The stochastic performance estimates from different random seeds and train-test splits must satisfy the concentration and reward assumptions needed by the bandit algorithms for the adaptive allocation and guarantees to hold.
What would settle it
Running the algorithm on a collection of models where the true best model is known from exhaustive evaluation, and checking whether it selects a different model more often than the claimed error probability allows.
Figures
read the original abstract
We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random seeds. We show that reliable model selection also requires evaluations based on multiple train-test splits (contrary to common practice in many shared tasks). Using bandit theory from the statistics literature, we are able to adaptively determine appropriate numbers of data splits and random seeds used to evaluate each model, focusing computational resources on the evaluation of promising models whilst avoiding wasting evaluations on models with lower performance. Furthermore, our user-friendly Python implementation produces confidence guarantees of correctly selecting the optimal model. We evaluate our algorithms by selecting between 8 target-dependent sentiment analysis methods using dramatically fewer model evaluations than current model selection approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FIESTA, a model selection method that applies adaptive bandit algorithms (drawing on UCB and successive-elimination variants) to identify the best model from a collection of candidates. It argues that reliable selection requires multiple random seeds and train-test splits (rather than single evaluations), adaptively allocates evaluations to focus resources on promising models, and supplies PAC-style confidence guarantees on correct selection of the optimum. The approach is evaluated by selecting among 8 target-dependent sentiment analysis models, claiming substantially fewer total evaluations than exhaustive or fixed-budget baselines.
Significance. If the performance estimates satisfy the requisite concentration assumptions, the work would supply a practical, statistically justified alternative to the common single-seed practice, with an open Python implementation that directly outputs selection guarantees. This addresses a genuine pain point in large-scale model comparison.
major comments (2)
- [theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.
- [method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.
minor comments (2)
- [abstract] The abstract states that the method 'significantly reduces the computational resources' but supplies no numerical comparison (e.g., total evaluations or wall-clock time) against the baselines used in the experiments.
- [preliminaries] Notation for the number of splits versus seeds is introduced without an explicit table or equation linking them to the bandit horizon parameter.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the importance of verifying the assumptions underlying our theoretical guarantees. We address each major comment below and will incorporate clarifications and diagnostics in the revised manuscript.
read point-by-point responses
-
Referee: [theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.
Authors: We agree that the PAC guarantees require the stated assumptions to hold for the observed accuracies. Each accuracy is bounded in [0,1] and thus 1/4-sub-Gaussian; independent random seeds and splits yield i.i.d. samples under standard assumptions. However, the manuscript does not include empirical diagnostics. In revision we will add a short subsection with variance histograms and normality checks on the per-model reward distributions from the sentiment experiments to support applicability of the bounds. revision: yes
-
Referee: [method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.
Authors: The mapping treats each (seed, split) evaluation as a direct [0,1]-bounded reward. Because seeds and splits are drawn independently, the sequence of rewards for each arm remains i.i.d. and bounded. We will revise the method section to state this mapping explicitly, note the boundedness, and reference the standard sub-Gaussian property of [0,1] random variables, thereby supporting the application of UCB and successive-elimination results. revision: yes
Circularity Check
No circularity; applies external bandit theory to model selection
full rationale
The derivation applies standard bandit algorithms (UCB, successive elimination) from the statistics literature to allocate evaluations across models, seeds, and splits, with PAC-style guarantees following directly from the cited external concentration results under the paper's stated i.i.d. sub-Gaussian assumptions. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citations appear, and the central claim of reduced evaluations with valid guarantees rests on independent statistical theory rather than internal redefinition or renaming.
Axiom & Free-Parameter Ledger
free parameters (1)
- bandit hyperparameters (e.g., exploration parameter)
axioms (1)
- domain assumption Performance scores from different seeds and splits behave as stochastic rewards obeying the concentration inequalities required by the bandit algorithms used.
Reference graph
Works this paper leans on
-
[1]
Andrea Arcuri and Lionel Briand. 2014. https://doi.org/10.1002/stvr.1486 A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering . Software Testing, Verification and Reliability, 24(3):219--250
-
[2]
Jean-Yves Audibert and S \'e bastien Bubeck. 2010. https://hal-enpc.archives-ouvertes.fr/hal-00654404/file/COLT10.pdf Best arm identification in multi-armed bandits . In COLT - 23th Conference on Learning Theory - 2010, pages 13--p
work page 2010
-
[3]
James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer Science & Business Media
work page 2013
-
[4]
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. http://aclweb.org/anthology/P18-1128 The hitchhiker's guide to testing statistical significance in natural language processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383--1392. Association for Computational ...
work page 2018
-
[5]
Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press
work page 1994
-
[6]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255--270. Springer
work page 2002
-
[7]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Springer series in statistics New York, NY, USA:
work page 2001
-
[8]
Yarin Gal and Zoubin Ghahramani. 2016. http://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf A theoretically grounded application of dropout in recurrent neural networks . In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing System...
work page 2016
-
[9]
Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. http://aclweb.org/anthology/W18-2501 Allennlp: A deep semantic natural language processing platform . In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1--6. Association for Computational...
work page 2018
-
[10]
Aur \'e lien Garivier and Emilie Kaufmann. 2016. http://proceedings.mlr.press/v49/garivier16a.pdf Optimal best arm identification with fixed confidence . In Conference on Learning Theory, pages 998--1027
work page 2016
-
[11]
Gholamreza Haffari, Tuan Dung Tran, and Mark Carman. 2017. http://aclweb.org/anthology/E17-1039 Efficient benchmarking of nlp apis using multi-armed bandits . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 408--416. Association for Computational Linguistics
work page 2017
-
[12]
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. https://arxiv.org/pdf/1207.0580.pdf Improving neural networks by preventing co-adaptation of feature detectors . arXiv preprint arXiv:1207.0580
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[13]
Junya Honda and Akimichi Takemura. 2014. http://proceedings.mlr.press/v33/honda14.pdf Optimality of thompson sampling for gaussian bandits depends on priors . In Artificial Intelligence and Statistics, pages 375--383
work page 2014
-
[14]
Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. 2013. https://arxiv.org/pdf/1306.3917.pdf On finding the largest mean among many . arXiv preprint arXiv:1306.3917
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[15]
Kevin Jamieson, Matthew Malloy, Robert Nowak, and S \'e bastien Bubeck. 2014. http://proceedings.mlr.press/v35/jamieson14.pdf lil’ucb: An optimal exploration algorithm for multi-armed bandits . In Conference on Learning Theory, pages 423--439
work page 2014
-
[16]
Kevin Jamieson and Robert Nowak. 2014. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6814096 Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting . In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE
work page 2014
-
[17]
Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab \'a s P \'o czos. 2018. http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf Parallelised bayesian optimisation via thompson sampling . In International Conference on Artificial Intelligence and Statistics
work page 2018
-
[18]
Zohar Karnin, Tomer Koren, and Oren Somekh. 2013. http://proceedings.mlr.press/v28/karnin13.pdf Almost optimal exploration in multi-armed bandits . In International Conference on Machine Learning, pages 1238--1246
work page 2013
-
[19]
Diederik P Kingma and Jimmy Ba. 2014. https://arxiv.org/pdf/1412.6980.pdf Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[20]
Ron Kohavi. 1995. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf A study of cross-validation and bootstrap for accuracy estimation and model selection . In Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2, pages 1137--1143. Morgan Kaufmann Publishers Inc
work page 1995
- [21]
-
[22]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. https://doi.org/10.18653/v1/N16-1030 Neural architectures for named entity recognition . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260--270. Associa...
-
[23]
Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. https://doi.org/10.18653/v1/D17-1272 Counterfactual learning from bandit feedback under deterministic logging : A case study in statistical machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566--2576. Association for Computation...
-
[24]
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. https://www.wwwconference.org/proceedings/www2010/www/p661.pdf A contextual-bandit approach to personalized news article recommendation . In Proceedings of the 19th international conference on World wide web, pages 661--670. ACM
work page 2010
-
[25]
Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. https://www.ijcai.org/proceedings/2017/0568.pdf Interactive attention networks for aspect-level sentiment classification . In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4068--4074. AAAI Press
work page 2017
-
[26]
Xuezhe Ma and Eduard Hovy. 2016. https://doi.org/10.18653/v1/P16-1101 End-to-end sequence labeling via bi-directional lstm-cnns-crf . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064--1074. Association for Computational Linguistics
-
[27]
Shie Mannor and John N Tsitsiklis. 2004. http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf The sample complexity of exploration in the multi-armed bandit problem . Journal of Machine Learning Research, 5(Jun):623--648
work page 2004
-
[28]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems, pages 3111--3119
work page 2013
-
[29]
Henry Moss, David Leslie, and Paul Rayson. 2018. http://aclweb.org/anthology/C18-1252 Using j-k-fold cross validation to reduce variance when tuning nlp models . In Proceedings of the 27th International Conference on Computational Linguistics, pages 2978--2989. Association for Computational Linguistics
work page 2018
-
[30]
Khanh Nguyen, Hal Daum \'e III, and Jordan Boyd-Graber. 2017. https://doi.org/10.18653/v1/D17-1153 Reinforcement learning for bandit neural machine translation with simulated human feedback . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464--1474. Association for Computational Linguistics
-
[31]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. https://openreview.net/pdf?id=BJJsrmfCZ Automatic differentiation in pytorch . In NIPS-W
work page 2017
-
[32]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543. Association for Computational Linguistics
-
[33]
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. https://doi.org/10.3115/v1/S14-2004 Semeval-2014 task 4: Aspect based sentiment analysis . In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27--35. Association for Computational Linguistics
-
[34]
Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338--348. Association for Computational Linguistics
-
[35]
Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/pdf/1803.09578.pdf Why comparing single performance scores does not allow to draw conclusions about machine learning approaches . arXiv preprint arXiv:1803.09578
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Daniel Russo. 2016. http://proceedings.mlr.press/v49/russo16.pdf Simple bayesian algorithms for best arm identification . In Conference on Learning Theory, pages 1417--1418
work page 2016
-
[37]
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. https://djrusso.github.io/RLCourse/papers/TS_Tutorial.pdf A tutorial on thompson sampling . Foundations and Trends in Machine Learning, 11(1):1--96
work page 2018
-
[38]
SemEval. 2018. https://doi.org/10.18653/v1/S18-1 Proceedings of The 12th International Workshop on Semantic Evaluation . Association for Computational Linguistics, New Orleans, Louisiana
-
[39]
Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016. https://doi.org/10.18653/v1/P16-1152 Learning structured predictors from bandit feedback for interactive nlp . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610--1620. Association for Computational Linguistics
-
[40]
Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016. http://aclweb.org/anthology/C16-1311 Effective lstms for target-dependent sentiment classification . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3298--3307. The COLING 2016 Organizing Committee
work page 2016
-
[41]
Sof \' a S Villar, Jack Bowden, and James Wason. 2015. https://doi.org/10.1214/14-STS504 Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges . Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199
-
[42]
Yequan Wang, Minlie Huang, xiaoyan zhu, and Li Zhao. 2016. https://doi.org/10.18653/v1/D16-1058 Attention-based lstm for aspect-level sentiment classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606--615. Association for Computational Linguistics
-
[43]
Jie Yang, Shuailong Liang, and Yue Zhang. 2018. http://aclweb.org/anthology/C18-1327 Design challenges and misconceptions in neural sequence labeling . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879--3889. Association for Computational Linguistics
work page 2018
-
[44]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[45]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.