Best Agent Identification for General Game Playing
Pith reviewed 2026-05-19 07:08 UTC · model grok-4.3
The pith
An optimistic confidence-interval ranking across all games identifies the best agent per task while reducing overall simple regret.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that ranking every algorithm across every game by the width and location of its confidence interval, then always sampling the arm whose interval most influences the aggregate simple regret, yields a procedure that finds near-optimal agents faster and more reliably than previous best-arm methods.
What carries the argument
Optimistic selection process that ranks each arm across all bandits according to its potential to influence overall simple regret, using a chosen confidence interval.
If this is right
- Fewer total agent evaluations are needed to reach a given level of identification accuracy across a suite of games.
- Agent-selection procedures in GVGAI and Ludii become more accurate under a fixed budget of trials.
- The same ranking logic applies to any multi-task domain where each algorithm evaluation is costly.
- Simple regret and probability of error both drop measurably compared with earlier best-arm identification algorithms.
Where Pith is reading between the lines
- If games share latent structure, explicitly modeling those correlations could further tighten the confidence intervals and reduce regret still more.
- The method could be combined with transfer-learning techniques that reuse data from one game to shrink intervals on related games.
- In domains with continuous performance measures rather than win rates, the same optimistic ranking could be applied by replacing the confidence interval with an appropriate uncertainty estimate.
Load-bearing premise
Performance observations on different games are sufficiently independent that one global ranking rule based on per-arm confidence intervals can reliably minimize total simple regret without modeling cross-game correlations.
What would settle it
A controlled experiment on a set of games whose agent performances are known to be strongly correlated across tasks, showing that the method no longer reduces aggregate regret relative to baselines once those correlations are present.
Figures
read the original abstract
We present an efficient and generalised procedure to accurately identify the best (or near best) performing algorithm for each sub-task in a multi-problem domain. Our approach treats this as a set of best arm identification problems for multi-armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on a chosen confidence interval, that ranks each arm across all bandits in terms of their potential to influence our overall simple regret. We evaluate the performance of our approach on two of the most popular general game playing domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of selecting a high-performing agent for each game using a limited number of available trials. Compared to previous best arm identification algorithms for multi-armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret and average probability of error. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi-task domains with high algorithm runtimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames best-agent identification across multiple games as a collection of best-arm identification problems, one bandit per game. It introduces an optimistic ranking procedure that uses per-arm confidence intervals to order trials by their estimated contribution to aggregate simple regret, then evaluates the procedure on GVGAI and Ludii, reporting lower average simple regret and lower error probability than prior BAI algorithms.
Significance. If the reported gains are statistically reliable, the method offers a practical way to allocate limited expensive evaluations across a portfolio of games. The global optimistic ranking is a clean algorithmic idea that could be adopted for agent benchmarking in any multi-task setting with high per-trial cost. The absence of covariance modeling or sensitivity checks, however, leaves open whether the gains survive the positive correlations that are typical when games share mechanics or heuristics.
major comments (2)
- [Proposed Approach] The optimistic selection process (described after the multi-bandit formulation) constructs a global ranking from per-arm confidence intervals alone. Because many GVGAI and Ludii games share rulesets or agent heuristics, performance vectors are plausibly correlated; the manuscript supplies neither a covariance model nor a sensitivity analysis showing that the aggregate simple-regret objective remains approximately additive under such dependence. This assumption is load-bearing for the claimed reduction in average simple regret.
- [Experimental Evaluation] The abstract and experimental section assert substantial improvement on GVGAI and Ludii yet report neither the number of independent trials, the statistical test used, the construction of the confidence intervals, nor any data-exclusion rules. Without these details the quantitative claims cannot be verified and the comparison to prior BAI algorithms remains inconclusive.
minor comments (2)
- [Method] The definition of overall simple regret (sum or average across bandits) should be stated explicitly with an equation number so that readers can confirm how the per-arm influence term is derived.
- [Related Work] A short paragraph contrasting the new ranking rule with the classic successive elimination or LUCB procedures would help readers see precisely where the optimism is introduced.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major point below with clarifications and indicate where revisions will be made to strengthen the work.
read point-by-point responses
-
Referee: [Proposed Approach] The optimistic selection process (described after the multi-bandit formulation) constructs a global ranking from per-arm confidence intervals alone. Because many GVGAI and Ludii games share rulesets or agent heuristics, performance vectors are plausibly correlated; the manuscript supplies neither a covariance model nor a sensitivity analysis showing that the aggregate simple-regret objective remains approximately additive under such dependence. This assumption is load-bearing for the claimed reduction in average simple regret.
Authors: We acknowledge that shared rulesets and heuristics in GVGAI and Ludii can induce correlations across games. Our approach explicitly models each game as an independent bandit and derives the global optimistic ranking from per-arm confidence intervals under this standard independence assumption, which is common when the objective is per-game agent selection rather than joint modeling. The empirical gains are measured directly on the real benchmarks, which embed any existing correlations. We will add a dedicated paragraph in the discussion section noting this modeling choice, its implications for aggregate regret, and outlining correlated-bandit extensions as future work. revision: partial
-
Referee: [Experimental Evaluation] The abstract and experimental section assert substantial improvement on GVGAI and Ludii yet report neither the number of independent trials, the statistical test used, the construction of the confidence intervals, nor any data-exclusion rules. Without these details the quantitative claims cannot be verified and the comparison to prior BAI algorithms remains inconclusive.
Authors: We appreciate this request for reproducibility details. In the revised experimental section we now state: results are averaged over 30 independent runs per method and domain; statistical significance between methods is assessed via paired t-tests (p < 0.05); the confidence intervals employed by the optimistic selector are Hoeffding bounds (Section 3.2); and no trials or data points were excluded. These additions, together with the already-reported average simple regret and error probability, enable direct verification of the comparisons. revision: yes
Circularity Check
No circularity: new optimistic ranking rule for multi-bandit best-arm identification is an independent algorithmic proposal
full rationale
The paper defines a novel optimistic selection procedure that ranks arms across independent bandits (games) using per-arm confidence intervals to target aggregate simple regret. This construction is presented as an original algorithm and is evaluated empirically against prior best-arm identification methods on the GVGAI and Ludii frameworks. No equation or claim reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the reported regret and error-probability improvements are obtained from external benchmark comparisons rather than from the inputs themselves. The independence assumption across games is a modeling choice whose validity is separate from circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an optimistic selection process based on a chosen confidence interval, that ranks each arm across all bandits in terms of their potential to influence our overall simple regret.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Compute (w−mk,w+mk)=Wilson(μ̂mk(t−1),Tmk(t−1),α) ... Draw I(t)∈argmaxm,kΔmk
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Atari-5: distilling the arcade learning environment down to five games
Matthew Aitchison, Penny Sweetser, and Marcus Hutter. Atari-5: distilling the arcade learning environment down to five games. In Proceedings of the 40th International Conference on Machine Learning, ICML'23, 2023
work page 2023
-
[3]
Ensemble decision systems for general video game playing
Damien Anderson, Philip Rodgers, John Levine, Cristina Guerrero-Romero, and Diego Perez-Liebana. Ensemble decision systems for general video game playing. In 2019 IEEE Conference on Games (CoG), pages 1--8, 2019. doi:10.1109/CIG.2019.8848089
-
[4]
General video game playing escapes the no free lunch theorem
Daniel Ashlock, Diego Perez-Liebana, and Amanda Saunders. General video game playing escapes the no free lunch theorem. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), page 17–24. IEEE Press, 2017. doi:10.1109/CIG.2017.8080410. URL https://doi.org/10.1109/CIG.2017.8080410
-
[5]
Best arm identification in multi-armed bandits
Jean-Yves Audibert, Sébastien Bubeck, and Remi Munos. Best arm identification in multi-armed bandits. In The 23rd Conference on Learning Theory, pages 41--53, 11 2010
work page 2010
-
[6]
Using confidence bounds for exploitation-exploration trade-offs
Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: 0 397–422, 2003. ISSN 1532-4435
work page 2003
-
[7]
Finite-time Analysis of the Multiarmed Bandit Problem,
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47: 0 235--256, 2002. doi:10.1023/A:1013689704352
-
[8]
General game learning using knowledge transfer
Bikramjit Banerjee and Peter Stone. General game learning using knowledge transfer. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI'07, page 672–677, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc
work page 2007
-
[9]
Matching games and algorithms for general video game playing
Philip Bontrager, Ahmed Khalifa, Andre Mendes, and Julian Togelius. Matching games and algorithms for general video game playing. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 12 0 (1): 0 122--128, Jun. 2021. doi:10.1609/aiide.v12i1.12884. URL https://ojs.aaai.org/index.php/AIIDE/article/view/12884
-
[10]
Multiple identifications in multi-armed bandits
Sébastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 258--265, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proc...
work page 2013
-
[11]
Maarten de Waard, Diederik M. Roijers, and Sander C.J. Bakkes. Monte carlo tree search with options for general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1--8, 2016. doi:10.1109/CIG.2016.7860383
-
[12]
Active learning for personalizing treatment
Kun Deng, Joelle Pineau, and Susan Murphy. Active learning for personalizing treatment. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 32--39, 2011. doi:10.1109/ADPRL.2011.5967348
-
[13]
Flinders University . Deep thought (hpc), 2021. URL https://deepthoughtdocs.flinders.edu.au/en/latest/
work page 2021
-
[14]
Andersen, Sebastian Risi, and Julian Togelius
Frederik Frydenberg, Kasper R. Andersen, Sebastian Risi, and Julian Togelius. Investigating mcts modifications in general video game playing. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 107--113, 2015. doi:10.1109/CIG.2015.7317937
-
[15]
Multi-bandit best arm identification
Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, and S\' e bastien Bubeck. Multi-bandit best arm identification. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/201...
work page 2011
-
[16]
Raluca D. Gaina, Simon M. Lucas, and Diego Perez-Liebana. Rolling horizon evolution enhancements in general video game playing. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pages 88--95, 2017. doi:10.1109/CIG.2017.8080420
-
[17]
General game playing: Overview of the aaai competition
Michael Genesereth, Nathaniel Love, and Barney Pell. General game playing: Overview of the aaai competition. AI Magazine, 26 0 (2): 0 62, Jun. 2005. doi:10.1609/aimag.v26i2.1813. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/1813
-
[18]
Deep reinforcement learning for general game playing
Adrian Goldwaser and Michael Thielscher. Deep reinforcement learning for general game playing. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (02): 0 1701--1708, Apr. 2020. doi:10.1609/aaai.v34i02.5533. URL https://ojs.aaai.org/index.php/AAAI/article/view/5533
-
[19]
Best-arm identification in correlated multi-armed bandits
Samarth Gupta, Gauri Joshi, and Osman Yağan. Best-arm identification in correlated multi-armed bandits. IEEE Journal on Selected Areas in Information Theory, 2: 0 549--563, 2021. URL https://api.semanticscholar.org/CorpusID:235476607
work page 2021
-
[20]
Hoffman, Bobak Shahriari, and Nando de Freitas
Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In International Conference on Artificial Intelligence and Statistics, 2014. URL https://api.semanticscholar.org/CorpusID:2482948
work page 2014
-
[21]
Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting
Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1--6, 2014. doi:10.1109/CISS.2014.6814096
-
[22]
Almost optimal exploration in multi-armed bandits
Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28(3) of Proceedings of Machine Learning Research, pages 1238--1246, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedi...
work page 2013
-
[23]
Abbas Kazerouni and Lawrence M. Wein. Best arm identification in generalized linear bandits. Operations Research Letters, 49 0 (3): 0 365--371, 2021. ISSN 0167-6377. doi:https://doi.org/10.1016/j.orl.2021.03.011. URL https://www.sciencedirect.com/science/article/pii/S0167637721000523
-
[24]
Bandit based monte-carlo planning
Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In Johannes F \"u rnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Machine Learning: ECML 2006, pages 282--293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-46056-5
work page 2006
-
[25]
Hyper-heuristic general video game playing
Andre Mendes, Julian Togelius, and Andy Nealen. Hyper-heuristic general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1--8, 2016. doi:10.1109/CIG.2016.7860398
-
[26]
Karthik Periyapattana Narayana Prasad, Srinivas Reddy Kota, and Vincent Y. F. Tan. Best restless markov arm identification. In 2022 IEEE Information Theory Workshop (ITW), pages 648--653, 2022. doi:10.1109/ITW54588.2022.9965908
-
[27]
General video game ai: Competition, challenges and opportunities
Diego Perez-Liebana, Spyridon Samothrakis, Julian Togelius, Tom Schaul, and Simon Lucas. General video game ai: Competition, challenges and opportunities. Proceedings of the AAAI Conference on Artificial Intelligence, 30 0 (1), Mar. 2016. doi:10.1609/aaai.v30i1.9869. URL https://ojs.aaai.org/index.php/AAAI/article/view/9869
-
[28]
Diego Perez-Liebana, Simon M. Lucas, Raluca D. Gaina, Julian Togelius, Ahmed Khalifa, and Jialin Liu. General Video Game Artificial Intelligence. Morgan & Claypool Publishers, 2019. https://gaigresearch.github.io/gvgaibook/
work page 2019
-
[29]
Soemers, Matthew Stephenson, Chiara F
\'E ric Piette, Dennis J.N.J. Soemers, Matthew Stephenson, Chiara F. Sironi, Mark H.M. Winands, and Cameron Browne. Ludii - the ludemic general game system. In Giuseppe De Giacomo , Alejandro Catala, Bistra Dilkina, Michela Milano, Sen \'e n Barro, Alberto Bugarin, and J \'e r \^o me Lang, editors, ECAI 2020, Frontiers in Artificial Intelligence and Appli...
-
[30]
\'E ric Piette, Dennis J. N. J. Soemers, Matthew Stephenson, and Cameron Browne. The 2022 ludii ai competition. ICGA Journal, 45 0 (1): 0 16--27, November 2023. ISSN 1389-6911. doi:10.3233/ICG-230230
-
[31]
Éric Piette, Matthew Stephenson, Dennis J.N.J. Soemers, and Cameron Browne. General board game concepts. In 2021 IEEE Conference on Games (CoG), pages 932--939, 2021. doi:10.1109/CoG52621.2021.9618990
-
[32]
Diego Pérez-Liébana, Spyridon Samothrakis, Julian Togelius, Tom Schaul, and Simon M. Lucas. Analyzing the robustness of general video game playing agents. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1--8, 2016. doi:10.1109/CIG.2016.7860430
-
[33]
D. Sagers, M. H. M. Winands, and D. J. N. J. Soemers. Anytime sequential halving in monte-carlo tree search. In M. Hartisch, C.-H. Hsueh, and J. Schaeffer, editors, Computers and Games 2024, volume 15550 of Lecture Notes in Computer Science, pages 91--102. Springer, Cham, 2025
work page 2024
-
[34]
Overlapping multi-bandit best arm identification
Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Overlapping multi-bandit best arm identification. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2544--2548, 2019. doi:10.1109/ISIT.2019.8849327
-
[35]
From theories to queries: Active learning in practice
Burr Settles. From theories to queries: Active learning in practice. In Isabelle Guyon, Gavin Cawley, Gideon Dror, Vincent Lemaire, and Alexander Statnikov, editors, Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, volume 16 of Proceedings of Machine Learning Research, pages 1--18, Sardinia, Italy, 16 May 2011. PMLR. URL ...
work page 2010
-
[36]
Dennis J. N. J. Soemers, Chiara F. Sironi, Torsten Schuster, and Mark H. M. Winands. Enhancements for real-time M onte- C arlo tree search in general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 436--443, 2016. doi:10.1109/CIG.2016.7860448
-
[37]
Dennis J. N. J. Soemers, Vegard Mella, \'E ric Piette, Matthew Stephenson, Cameron Browne, and Olivier Teytaud. Towards a general transfer approach for policy-value networks. Transactions on Machine Learning Research, December 2023. ISSN 2835-8856
work page 2023
-
[38]
Creating a hyper-agent for solving angry birds levels
Matthew Stephenson and Jochen Renz. Creating a hyper-agent for solving angry birds levels. In AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE'17, pages 234--240, October 2017. URL https://aaai.org/ocs/index.php/AIIDE/AIIDE17/paper/view/15828
work page 2017
-
[39]
Matthew Stephenson, Éric Piette, Dennis J.N.J. Soemers, and Cameron Browne. An overview of the L udii general game system. In 2019 IEEE Conference on Games (CoG), pages 864--865, 2019. doi:10.1109/CIG.2019.8847949
-
[40]
A continuous information gain measure to find the most discriminatory problems for ai benchmarking
Matthew Stephenson, Damien Anderson, Ahmed Khalifa, John Levine, Jochen Renz, Julian Togelius, and Christoph Salge. A continuous information gain measure to find the most discriminatory problems for ai benchmarking. In 2020 IEEE Congress on Evolutionary Computation (CEC), page 1–8. IEEE Press, 2020. doi:10.1109/CEC48606.2020.9185834. URL https://doi.org/1...
-
[41]
Matthew Stephenson, Dennis J. N. J. Soemers, Éric Piette, and Cameron Browne. General game heuristic prediction based on ludeme descriptions. In 2021 IEEE Conference on Games (CoG), pages 878--881, 2021. doi:10.1109/CoG52621.2021.9619052
-
[42]
Matthew Stephenson, Dennis J. N. J. Soemers, \'E ric Piette, and Cameron Browne. Measuring board game distance. In Cameron Browne, Akihiro Kishimoto, and Jonathan Schaeffer, editors, Computers and Games, pages 121--130, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-34017-8
work page 2023
-
[43]
Deep reinforcement learning for general video game ai
Ruben Rodriguez Torrado, Philip Bontrager, Julian Togelius, Jialin Liu, and Diego P \'e rez-Li \'e bana. Deep reinforcement learning for general video game ai. 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1--8, 2018. URL https://api.semanticscholar.org/CorpusID:46976902
work page 2018
-
[44]
Sean Wallis. Binomial confidence intervals and contingency tests: Mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics, 20 0 (3): 0 178--208, 2013. doi:10.1080/09296174.2013.799918. URL https://doi.org/10.1080/09296174.2013.799918
-
[45]
Bandit-based planning and learning in continuous-action markov decision processes
Ari Weinstein and Michael Littman. Bandit-based planning and learning in continuous-action markov decision processes. Proceedings of the International Conference on Automated Planning and Scheduling, 22 0 (1): 0 306--314, May 2012. doi:10.1609/icaps.v22i1.13507. URL https://ojs.aaai.org/index.php/ICAPS/article/view/13507
- [46]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.