FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

Andrew Moore; David S. Leslie; Henry B. Moss; Paul Rayson

arxiv: 1906.12230 · v1 · pith:WPJYGY5Dnew · submitted 2019-06-28 · 💻 cs.LG · cs.CL· stat.ML

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

Henry B. Moss , Andrew Moore , David S. Leslie , Paul Rayson This is my paper

Pith reviewed 2026-05-25 13:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords model selectionbandit algorithmsadaptive evaluationconfidence guaranteesmachine learningsentiment analysisstochastic optimization

0 comments

The pith

FIESTA uses adaptive bandit algorithms to identify the best model from many candidates with fewer evaluations and statistical guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FIESTA is a model selection method that applies ideas from bandit algorithms to decide how many times to evaluate each candidate model on different data splits and random seeds. It spends more effort on models that look promising and less on those that do not, while still giving a guarantee that the chosen model is the true best one with high probability. This addresses the common but unreliable practice of comparing models on single evaluations or single splits. A reader would care because selecting among dozens of models can require thousands of training runs, and this approach cuts that cost substantially without sacrificing reliability.

Core claim

The paper introduces FIESTA as an approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models by adaptively determining the numbers of data splits and random seeds for each model using bandit theory, with a Python implementation that produces confidence guarantees of correctly selecting the optimal model. It demonstrates this by selecting among 8 sentiment analysis methods using dramatically fewer evaluations.

What carries the argument

The FIESTA algorithm, which adaptively allocates evaluations across models and data splits using bandit algorithms to focus resources on promising models while maintaining selection guarantees.

If this is right

Reliable identification of the top model requires multiple train-test splits in addition to multiple random seeds.
Computational cost of model selection drops substantially compared to fixed evaluation budgets.
The method provides explicit probabilistic guarantees on selecting the correct optimal model.
The approach is demonstrated on target-dependent sentiment analysis but applies generally to model collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be extended to hyperparameter tuning or neural architecture search where many variants need comparison.
It suggests that shared task evaluations should mandate multiple splits to avoid unreliable rankings.
Practitioners might integrate it into standard model comparison workflows to save resources.

Load-bearing premise

The stochastic performance estimates from different random seeds and train-test splits must satisfy the concentration and reward assumptions needed by the bandit algorithms for the adaptive allocation and guarantees to hold.

What would settle it

Running the algorithm on a collection of models where the true best model is known from exhaustive evaluation, and checking whether it selects a different model more often than the claimed error probability allows.

Figures

Figures reproduced from arXiv: 1906.12230 by Andrew Moore, David S. Leslie, Henry B. Moss, Paul Rayson.

**Figure 2.** Figure 2: TTTS seeking the optimal model with confi [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: F1 scores for our candidate TDSA models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Proportion of the runs correctly selecting the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random seeds. We show that reliable model selection also requires evaluations based on multiple train-test splits (contrary to common practice in many shared tasks). Using bandit theory from the statistics literature, we are able to adaptively determine appropriate numbers of data splits and random seeds used to evaluate each model, focusing computational resources on the evaluation of promising models whilst avoiding wasting evaluations on models with lower performance. Furthermore, our user-friendly Python implementation produces confidence guarantees of correctly selecting the optimal model. We evaluate our algorithms by selecting between 8 target-dependent sentiment analysis methods using dramatically fewer model evaluations than current model selection approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FIESTA applies bandits to cut evaluations in multi-seed multi-split model comparisons, but the guarantees hinge on untested assumptions about the performance estimates.

read the letter

The core contribution is taking existing bandit methods and using them to decide how many random seeds and data splits to run for each candidate model, stopping early on weaker ones. They demonstrate this on eight sentiment analysis models and report fewer total evaluations than fixed-budget baselines while claiming PAC-style guarantees on selecting the best model. The Python package is presented as ready to use, which is a practical plus. They also correctly note that single-seed results are unreliable and that multiple splits are needed for stable comparisons, which matches common experience in the field. That part is straightforward and useful. The soft spot is the one flagged in the stress test. Bandit algorithms like UCB or successive elimination require the per-evaluation rewards (accuracy or F1 from each seed/split) to be roughly i.i.d. and sub-Gaussian or bounded. Nothing in the abstract or description indicates they checked whether real ML performance numbers satisfy this; correlations across splits or model-specific variance patterns could easily break the regret bounds and the final selection guarantee. Without diagnostics or sensitivity analysis, the confidence claims rest on an assumption that may not hold in practice. This paper is aimed at practitioners who compare many models under compute constraints, especially in NLP. It is worth sending for peer review because the adaptive allocation idea is concrete and the implementation lowers the barrier, even though the statistical justification needs more scrutiny on the reward model.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FIESTA, a model selection method that applies adaptive bandit algorithms (drawing on UCB and successive-elimination variants) to identify the best model from a collection of candidates. It argues that reliable selection requires multiple random seeds and train-test splits (rather than single evaluations), adaptively allocates evaluations to focus resources on promising models, and supplies PAC-style confidence guarantees on correct selection of the optimum. The approach is evaluated by selecting among 8 target-dependent sentiment analysis models, claiming substantially fewer total evaluations than exhaustive or fixed-budget baselines.

Significance. If the performance estimates satisfy the requisite concentration assumptions, the work would supply a practical, statistically justified alternative to the common single-seed practice, with an open Python implementation that directly outputs selection guarantees. This addresses a genuine pain point in large-scale model comparison.

major comments (2)

[theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.
[method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.

minor comments (2)

[abstract] The abstract states that the method 'significantly reduces the computational resources' but supplies no numerical comparison (e.g., total evaluations or wall-clock time) against the baselines used in the experiments.
[preliminaries] Notation for the number of splits versus seeds is introduced without an explicit table or equation linking them to the bandit horizon parameter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the importance of verifying the assumptions underlying our theoretical guarantees. We address each major comment below and will incorporate clarifications and diagnostics in the revised manuscript.

read point-by-point responses

Referee: [theoretical analysis and experimental sections] The central claim of valid confidence guarantees (abstract and § on theoretical results) rests on the stochastic performance estimates (accuracy/F1 from random seeds and splits) satisfying the i.i.d. sub-Gaussian or bounded-reward assumptions of the chosen bandit algorithms. No diagnostic is supplied showing that the empirical reward distributions meet these conditions; correlations across splits or model-specific variance can invalidate the regret and PAC bounds used to justify early stopping and final selection.

Authors: We agree that the PAC guarantees require the stated assumptions to hold for the observed accuracies. Each accuracy is bounded in [0,1] and thus 1/4-sub-Gaussian; independent random seeds and splits yield i.i.d. samples under standard assumptions. However, the manuscript does not include empirical diagnostics. In revision we will add a short subsection with variance histograms and normality checks on the per-model reward distributions from the sentiment experiments to support applicability of the bounds. revision: yes
Referee: [method description] § describing the adaptive allocation procedure: the mapping from observed accuracies to bandit rewards is not shown to preserve the independence and boundedness properties required by the cited algorithms; without this, the claimed reduction in evaluations while retaining guarantees is not supported.

Authors: The mapping treats each (seed, split) evaluation as a direct [0,1]-bounded reward. Because seeds and splits are drawn independently, the sequence of rewards for each arm remains i.i.d. and bounded. We will revise the method section to state this mapping explicitly, note the boundedness, and reference the standard sub-Gaussian property of [0,1] random variables, thereby supporting the application of UCB and successive-elimination results. revision: yes

Circularity Check

0 steps flagged

No circularity; applies external bandit theory to model selection

full rationale

The derivation applies standard bandit algorithms (UCB, successive elimination) from the statistics literature to allocate evaluations across models, seeds, and splits, with PAC-style guarantees following directly from the cited external concentration results under the paper's stated i.i.d. sub-Gaussian assumptions. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citations appear, and the central claim of reduced evaluations with valid guarantees rests on independent statistical theory rather than internal redefinition or renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on standard statistical assumptions from bandit literature being applicable to model performance scores; no new entities are introduced.

free parameters (1)

bandit hyperparameters (e.g., exploration parameter)
Typical bandit algorithms require at least one tunable parameter that controls the exploration-exploitation trade-off; its value is not specified in the abstract.

axioms (1)

domain assumption Performance scores from different seeds and splits behave as stochastic rewards obeying the concentration inequalities required by the bandit algorithms used.
This assumption is necessary for the adaptive allocation and the claimed confidence guarantees to hold.

pith-pipeline@v0.9.0 · 5692 in / 1210 out tokens · 36485 ms · 2026-05-25T13:31:31.540450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Andrea Arcuri and Lionel Briand. 2014. https://doi.org/10.1002/stvr.1486 A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering . Software Testing, Verification and Reliability, 24(3):219--250

work page doi:10.1002/stvr.1486 2014
[2]

Jean-Yves Audibert and S \'e bastien Bubeck. 2010. https://hal-enpc.archives-ouvertes.fr/hal-00654404/file/COLT10.pdf Best arm identification in multi-armed bandits . In COLT - 23th Conference on Learning Theory - 2010, pages 13--p

work page 2010
[3]

James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer Science & Business Media

work page 2013
[4]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. http://aclweb.org/anthology/P18-1128 The hitchhiker's guide to testing statistical significance in natural language processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383--1392. Association for Computational ...

work page 2018
[5]

Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press

work page 1994
[6]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255--270. Springer

work page 2002
[7]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Springer series in statistics New York, NY, USA:

work page 2001
[8]

Yarin Gal and Zoubin Ghahramani. 2016. http://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf A theoretically grounded application of dropout in recurrent neural networks . In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing System...

work page 2016
[9]

Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. http://aclweb.org/anthology/W18-2501 Allennlp: A deep semantic natural language processing platform . In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1--6. Association for Computational...

work page 2018
[10]

Aur \'e lien Garivier and Emilie Kaufmann. 2016. http://proceedings.mlr.press/v49/garivier16a.pdf Optimal best arm identification with fixed confidence . In Conference on Learning Theory, pages 998--1027

work page 2016
[11]

Gholamreza Haffari, Tuan Dung Tran, and Mark Carman. 2017. http://aclweb.org/anthology/E17-1039 Efficient benchmarking of nlp apis using multi-armed bandits . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 408--416. Association for Computational Linguistics

work page 2017
[12]

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. https://arxiv.org/pdf/1207.0580.pdf Improving neural networks by preventing co-adaptation of feature detectors . arXiv preprint arXiv:1207.0580

work page internal anchor Pith review Pith/arXiv arXiv 2012
[13]

Junya Honda and Akimichi Takemura. 2014. http://proceedings.mlr.press/v33/honda14.pdf Optimality of thompson sampling for gaussian bandits depends on priors . In Artificial Intelligence and Statistics, pages 375--383

work page 2014
[14]

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. 2013. https://arxiv.org/pdf/1306.3917.pdf On finding the largest mean among many . arXiv preprint arXiv:1306.3917

work page internal anchor Pith review Pith/arXiv arXiv 2013
[15]

Kevin Jamieson, Matthew Malloy, Robert Nowak, and S \'e bastien Bubeck. 2014. http://proceedings.mlr.press/v35/jamieson14.pdf lil’ucb: An optimal exploration algorithm for multi-armed bandits . In Conference on Learning Theory, pages 423--439

work page 2014
[16]

Kevin Jamieson and Robert Nowak. 2014. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6814096 Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting . In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE

work page 2014
[17]

Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab \'a s P \'o czos. 2018. http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf Parallelised bayesian optimisation via thompson sampling . In International Conference on Artificial Intelligence and Statistics

work page 2018
[18]

Zohar Karnin, Tomer Koren, and Oren Somekh. 2013. http://proceedings.mlr.press/v28/karnin13.pdf Almost optimal exploration in multi-armed bandits . In International Conference on Machine Learning, pages 1238--1246

work page 2013
[19]

Diederik P Kingma and Jimmy Ba. 2014. https://arxiv.org/pdf/1412.6980.pdf Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Ron Kohavi. 1995. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf A study of cross-validation and bootstrap for accuracy estimation and model selection . In Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2, pages 1137--1143. Morgan Kaufmann Publishers Inc

work page 1995
[21]

Tze Leung Lai and Herbert Robbins. 1985. https://core.ac.uk/download/pdf/82425825.pdf Asymptotically efficient adaptive allocation rules . Advances in applied mathematics, 6(1):4--22

work page arXiv 1985
[22]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. https://doi.org/10.18653/v1/N16-1030 Neural architectures for named entity recognition . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260--270. Associa...

work page doi:10.18653/v1/n16-1030 2016
[23]

Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. https://doi.org/10.18653/v1/D17-1272 Counterfactual learning from bandit feedback under deterministic logging : A case study in statistical machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566--2576. Association for Computation...

work page doi:10.18653/v1/d17-1272 2017
[24]

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. https://www.wwwconference.org/proceedings/www2010/www/p661.pdf A contextual-bandit approach to personalized news article recommendation . In Proceedings of the 19th international conference on World wide web, pages 661--670. ACM

work page 2010
[25]

Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. https://www.ijcai.org/proceedings/2017/0568.pdf Interactive attention networks for aspect-level sentiment classification . In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4068--4074. AAAI Press

work page 2017
[26]

Xuezhe Ma and Eduard Hovy. 2016. https://doi.org/10.18653/v1/P16-1101 End-to-end sequence labeling via bi-directional lstm-cnns-crf . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064--1074. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1101 2016
[27]

Shie Mannor and John N Tsitsiklis. 2004. http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf The sample complexity of exploration in the multi-armed bandit problem . Journal of Machine Learning Research, 5(Jun):623--648

work page 2004
[28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems, pages 3111--3119

work page 2013
[29]

Henry Moss, David Leslie, and Paul Rayson. 2018. http://aclweb.org/anthology/C18-1252 Using j-k-fold cross validation to reduce variance when tuning nlp models . In Proceedings of the 27th International Conference on Computational Linguistics, pages 2978--2989. Association for Computational Linguistics

work page 2018
[30]

Khanh Nguyen, Hal Daum \'e III, and Jordan Boyd-Graber. 2017. https://doi.org/10.18653/v1/D17-1153 Reinforcement learning for bandit neural machine translation with simulated human feedback . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464--1474. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1153 2017
[31]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. https://openreview.net/pdf?id=BJJsrmfCZ Automatic differentiation in pytorch . In NIPS-W

work page 2017
[32]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543. Association for Computational Linguistics

work page doi:10.3115/v1/d14-1162 2014
[33]

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. https://doi.org/10.3115/v1/S14-2004 Semeval-2014 task 4: Aspect based sentiment analysis . In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27--35. Association for Computational Linguistics

work page doi:10.3115/v1/s14-2004 2014
[34]

Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338--348. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1035 2017
[35]

Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/pdf/1803.09578.pdf Why comparing single performance scores does not allow to draw conclusions about machine learning approaches . arXiv preprint arXiv:1803.09578

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Daniel Russo. 2016. http://proceedings.mlr.press/v49/russo16.pdf Simple bayesian algorithms for best arm identification . In Conference on Learning Theory, pages 1417--1418

work page 2016
[37]

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. https://djrusso.github.io/RLCourse/papers/TS_Tutorial.pdf A tutorial on thompson sampling . Foundations and Trends in Machine Learning, 11(1):1--96

work page 2018
[38]

SemEval. 2018. https://doi.org/10.18653/v1/S18-1 Proceedings of The 12th International Workshop on Semantic Evaluation . Association for Computational Linguistics, New Orleans, Louisiana

work page doi:10.18653/v1/s18-1 2018
[39]

Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016. https://doi.org/10.18653/v1/P16-1152 Learning structured predictors from bandit feedback for interactive nlp . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610--1620. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1152 2016
[40]

Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016. http://aclweb.org/anthology/C16-1311 Effective lstms for target-dependent sentiment classification . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3298--3307. The COLING 2016 Organizing Committee

work page 2016
[41]

Sof \' a S Villar, Jack Bowden, and James Wason. 2015. https://doi.org/10.1214/14-STS504 Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges . Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199

work page doi:10.1214/14-sts504 2015
[42]

Yequan Wang, Minlie Huang, xiaoyan zhu, and Li Zhao. 2016. https://doi.org/10.18653/v1/D16-1058 Attention-based lstm for aspect-level sentiment classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606--615. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1058 2016
[43]

Jie Yang, Shuailong Liang, and Yue Zhang. 2018. http://aclweb.org/anthology/C18-1327 Design challenges and misconceptions in neural sequence labeling . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879--3889. Association for Computational Linguistics

work page 2018
[44]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[45]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Andrea Arcuri and Lionel Briand. 2014. https://doi.org/10.1002/stvr.1486 A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering . Software Testing, Verification and Reliability, 24(3):219--250

work page doi:10.1002/stvr.1486 2014

[2] [2]

Jean-Yves Audibert and S \'e bastien Bubeck. 2010. https://hal-enpc.archives-ouvertes.fr/hal-00654404/file/COLT10.pdf Best arm identification in multi-armed bandits . In COLT - 23th Conference on Learning Theory - 2010, pages 13--p

work page 2010

[3] [3]

James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer Science & Business Media

work page 2013

[4] [4]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. http://aclweb.org/anthology/P18-1128 The hitchhiker's guide to testing statistical significance in natural language processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383--1392. Association for Computational ...

work page 2018

[5] [5]

Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press

work page 1994

[6] [6]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. 2002. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255--270. Springer

work page 2002

[7] [7]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Springer series in statistics New York, NY, USA:

work page 2001

[8] [8]

Yarin Gal and Zoubin Ghahramani. 2016. http://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdf A theoretically grounded application of dropout in recurrent neural networks . In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing System...

work page 2016

[9] [9]

Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. http://aclweb.org/anthology/W18-2501 Allennlp: A deep semantic natural language processing platform . In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1--6. Association for Computational...

work page 2018

[10] [10]

Aur \'e lien Garivier and Emilie Kaufmann. 2016. http://proceedings.mlr.press/v49/garivier16a.pdf Optimal best arm identification with fixed confidence . In Conference on Learning Theory, pages 998--1027

work page 2016

[11] [11]

Gholamreza Haffari, Tuan Dung Tran, and Mark Carman. 2017. http://aclweb.org/anthology/E17-1039 Efficient benchmarking of nlp apis using multi-armed bandits . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 408--416. Association for Computational Linguistics

work page 2017

[12] [12]

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. https://arxiv.org/pdf/1207.0580.pdf Improving neural networks by preventing co-adaptation of feature detectors . arXiv preprint arXiv:1207.0580

work page internal anchor Pith review Pith/arXiv arXiv 2012

[13] [13]

Junya Honda and Akimichi Takemura. 2014. http://proceedings.mlr.press/v33/honda14.pdf Optimality of thompson sampling for gaussian bandits depends on priors . In Artificial Intelligence and Statistics, pages 375--383

work page 2014

[14] [14]

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. 2013. https://arxiv.org/pdf/1306.3917.pdf On finding the largest mean among many . arXiv preprint arXiv:1306.3917

work page internal anchor Pith review Pith/arXiv arXiv 2013

[15] [15]

Kevin Jamieson, Matthew Malloy, Robert Nowak, and S \'e bastien Bubeck. 2014. http://proceedings.mlr.press/v35/jamieson14.pdf lil’ucb: An optimal exploration algorithm for multi-armed bandits . In Conference on Learning Theory, pages 423--439

work page 2014

[16] [16]

Kevin Jamieson and Robert Nowak. 2014. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6814096 Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting . In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE

work page 2014

[17] [17]

Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab \'a s P \'o czos. 2018. http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf Parallelised bayesian optimisation via thompson sampling . In International Conference on Artificial Intelligence and Statistics

work page 2018

[18] [18]

Zohar Karnin, Tomer Koren, and Oren Somekh. 2013. http://proceedings.mlr.press/v28/karnin13.pdf Almost optimal exploration in multi-armed bandits . In International Conference on Machine Learning, pages 1238--1246

work page 2013

[19] [19]

Diederik P Kingma and Jimmy Ba. 2014. https://arxiv.org/pdf/1412.6980.pdf Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [20]

Ron Kohavi. 1995. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf A study of cross-validation and bootstrap for accuracy estimation and model selection . In Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2, pages 1137--1143. Morgan Kaufmann Publishers Inc

work page 1995

[21] [21]

Tze Leung Lai and Herbert Robbins. 1985. https://core.ac.uk/download/pdf/82425825.pdf Asymptotically efficient adaptive allocation rules . Advances in applied mathematics, 6(1):4--22

work page arXiv 1985

[22] [22]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. https://doi.org/10.18653/v1/N16-1030 Neural architectures for named entity recognition . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260--270. Associa...

work page doi:10.18653/v1/n16-1030 2016

[23] [23]

Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. https://doi.org/10.18653/v1/D17-1272 Counterfactual learning from bandit feedback under deterministic logging : A case study in statistical machine translation . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566--2576. Association for Computation...

work page doi:10.18653/v1/d17-1272 2017

[24] [24]

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. https://www.wwwconference.org/proceedings/www2010/www/p661.pdf A contextual-bandit approach to personalized news article recommendation . In Proceedings of the 19th international conference on World wide web, pages 661--670. ACM

work page 2010

[25] [25]

Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. https://www.ijcai.org/proceedings/2017/0568.pdf Interactive attention networks for aspect-level sentiment classification . In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4068--4074. AAAI Press

work page 2017

[26] [26]

Xuezhe Ma and Eduard Hovy. 2016. https://doi.org/10.18653/v1/P16-1101 End-to-end sequence labeling via bi-directional lstm-cnns-crf . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064--1074. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1101 2016

[27] [27]

Shie Mannor and John N Tsitsiklis. 2004. http://www.jmlr.org/papers/volume5/mannor04b/mannor04b.pdf The sample complexity of exploration in the multi-armed bandit problem . Journal of Machine Learning Research, 5(Jun):623--648

work page 2004

[28] [28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems, pages 3111--3119

work page 2013

[29] [29]

Henry Moss, David Leslie, and Paul Rayson. 2018. http://aclweb.org/anthology/C18-1252 Using j-k-fold cross validation to reduce variance when tuning nlp models . In Proceedings of the 27th International Conference on Computational Linguistics, pages 2978--2989. Association for Computational Linguistics

work page 2018

[30] [30]

Khanh Nguyen, Hal Daum \'e III, and Jordan Boyd-Graber. 2017. https://doi.org/10.18653/v1/D17-1153 Reinforcement learning for bandit neural machine translation with simulated human feedback . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464--1474. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1153 2017

[31] [31]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. https://openreview.net/pdf?id=BJJsrmfCZ Automatic differentiation in pytorch . In NIPS-W

work page 2017

[32] [32]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543. Association for Computational Linguistics

work page doi:10.3115/v1/d14-1162 2014

[33] [33]

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. https://doi.org/10.3115/v1/S14-2004 Semeval-2014 task 4: Aspect based sentiment analysis . In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27--35. Association for Computational Linguistics

work page doi:10.3115/v1/s14-2004 2014

[34] [34]

Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338--348. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1035 2017

[35] [35]

Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/pdf/1803.09578.pdf Why comparing single performance scores does not allow to draw conclusions about machine learning approaches . arXiv preprint arXiv:1803.09578

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Daniel Russo. 2016. http://proceedings.mlr.press/v49/russo16.pdf Simple bayesian algorithms for best arm identification . In Conference on Learning Theory, pages 1417--1418

work page 2016

[37] [37]

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. https://djrusso.github.io/RLCourse/papers/TS_Tutorial.pdf A tutorial on thompson sampling . Foundations and Trends in Machine Learning, 11(1):1--96

work page 2018

[38] [38]

SemEval. 2018. https://doi.org/10.18653/v1/S18-1 Proceedings of The 12th International Workshop on Semantic Evaluation . Association for Computational Linguistics, New Orleans, Louisiana

work page doi:10.18653/v1/s18-1 2018

[39] [39]

Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016. https://doi.org/10.18653/v1/P16-1152 Learning structured predictors from bandit feedback for interactive nlp . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610--1620. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1152 2016

[40] [40]

Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016. http://aclweb.org/anthology/C16-1311 Effective lstms for target-dependent sentiment classification . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3298--3307. The COLING 2016 Organizing Committee

work page 2016

[41] [41]

Sof \' a S Villar, Jack Bowden, and James Wason. 2015. https://doi.org/10.1214/14-STS504 Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges . Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199

work page doi:10.1214/14-sts504 2015

[42] [42]

Yequan Wang, Minlie Huang, xiaoyan zhu, and Li Zhao. 2016. https://doi.org/10.18653/v1/D16-1058 Attention-based lstm for aspect-level sentiment classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606--615. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1058 2016

[43] [43]

Jie Yang, Shuailong Liang, and Yue Zhang. 2018. http://aclweb.org/anthology/C18-1327 Design challenges and misconceptions in neural sequence labeling . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879--3889. Association for Computational Linguistics

work page 2018

[44] [44]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[45] [45]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page