Analysis of Search Heuristics in the Multi-Armed Bandit Setting
Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3
The pith
The (1+1) EA selects the Condorcet winner only with constant probability when its advantage is Omega(1/n) in dueling bandits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the binary stochastic dueling bandit setting with a Condorcet winner that beats every other arm with probability 1-p, the (1+1) EA chooses the Condorcet winner in its stationary distribution only with constant probability if p = Omega(1/n). By contrast, a simple EDA based on the Max-Min Ant System with iteration-best update will choose the Condorcet winner in its maintained distribution with probability 1-Theta(p). Repeated duels significantly boost the probability of the Condorcet winner in the stationary distribution of the (1+1) EA.
What carries the argument
The stationary distribution of the (1+1) EA and the maintained probability distribution of the EDA, both derived from fixed pairwise winning probabilities in the binary stochastic dueling model.
If this is right
- When the advantage p is only linear in 1/n the (1+1) EA will not converge to selecting the superior arm with probability approaching 1.
- The EDA reliably favors the Condorcet winner even when p is small.
- Allowing repeated comparisons inside the (1+1) EA raises its long-run selection probability for the winner.
- Search heuristics that rely on single pairwise comparisons can underperform in settings with probabilistic outcomes.
Where Pith is reading between the lines
- Distribution-based methods may be preferable to simple mutation-based EAs when feedback arrives only through noisy pairwise comparisons.
- The performance gap could guide algorithm choice in ranking or recommendation tasks that use dueling feedback.
- Variants of the (1+1) EA with larger populations might reduce the need for repeated duels.
- The same analysis framework could be applied to other heuristics that use tournament-style selection.
Load-bearing premise
A Condorcet winner exists whose winning probability against every other arm is a fixed value 1-p.
What would settle it
Simulate the (1+1) EA for large n on an instance where the best arm wins each duel with probability 1-1/n and measure whether its long-run selection frequency remains bounded below 1 by a positive constant independent of n.
Figures
read the original abstract
We consider the classic Multi-Armed Bandit setting to understand the exploration/exploitation tradeoffs made by different search heuristics. Since many search heuristics work by comparing different options (in evolutionary algorithms called "individuals"; in the Bandit literature called "arms"), we work with the "Dueling Bandits" setting. In each iteration, a comparison between different arms can be made; in the binary stochastic setting, each arm has a fixed winning probability against any other arm. A Condorcet winner is any arm that beats every other arm with a probability strictly higher than $1/2$. We show that evolutionary algorithms are rather bad at identifying the Condorcet winner: Even if the Condorcet winner beats every other arm with a probability $1-p$, the (1+1) EA, in its stationary distribution, chooses the Condorcet winner only with constant probability if $p=\Omega(1/n)$. By contrast, we show that a simple EDA (based on the Max-Min Ant System with iteration-best update) will choose the Condorcet winner in its maintained distribution with probability $1-\Theta(p)$. As a remedy for the (1+1) EA, we show how repeated duels can significantly boost the probability of the Condorcet winner in the stationary distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes search heuristics in the dueling multi-armed bandit setting with a Condorcet winner. It claims that the (1+1) EA selects the Condorcet winner only with constant probability in its stationary distribution when the advantage parameter p is Ω(1/n), while a simple EDA based on the Max-Min Ant System with iteration-best update achieves selection probability 1-Θ(p). The paper also proposes repeated duels as a remedy to improve the EA's stationary probability.
Significance. If the results hold, the work is significant for evolutionary computation and bandit research by exposing limitations of standard (1+1) EAs under noisy pairwise comparisons and highlighting potential advantages of distribution-maintaining methods. It provides a theoretical lens on exploration/exploitation tradeoffs that could inform heuristic design, though the lack of mixing-time analysis weakens the practical interpretation of the stationary-distribution claims.
major comments (2)
- [Analysis of (1+1) EA stationary distribution] The central claim for the (1+1) EA (constant probability of selecting the Condorcet winner in stationary distribution when p=Ω(1/n)) relies on the Markov chain over arms with transition probabilities governed by fixed duel outcomes (Condorcet winner wins with probability 1-p). No mixing-time or relaxation-time bound is provided. If mixing time scales as Ω(1/p) or worse, the distribution may remain far from stationary after polynomially many iterations, undermining the relevance of the stationary probability as a performance metric.
- [Theoretical claims and derivations] The abstract and claims reference specific probability bounds and stationary distributions derived from the binary stochastic dueling model, but the manuscript provides no explicit proofs, error analysis, or verification of the Markov-chain stationary distributions. This gap is load-bearing for assessing whether the constant-probability claim for the EA and the 1-Θ(p) claim for the EDA are correct.
minor comments (1)
- [Introduction and model definition] The weakest assumption (binary stochastic dueling with fixed pairwise probabilities and existence of a Condorcet winner) should be stated more explicitly at the outset, including any restrictions on how p scales with n.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our analysis of search heuristics in the dueling bandit setting. The comments correctly identify gaps in our presentation of the stationary-distribution results. We respond to each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Analysis of (1+1) EA stationary distribution] The central claim for the (1+1) EA (constant probability of selecting the Condorcet winner in stationary distribution when p=Ω(1/n)) relies on the Markov chain over arms with transition probabilities governed by fixed duel outcomes (Condorcet winner wins with probability 1-p). No mixing-time or relaxation-time bound is provided. If mixing time scales as Ω(1/p) or worse, the distribution may remain far from stationary after polynomially many iterations, undermining the relevance of the stationary probability as a performance metric.
Authors: We agree that the lack of a mixing-time or relaxation-time analysis limits the direct applicability of the stationary-distribution claims to finite-time performance. The stationary distribution is the natural long-run measure for the inherent selection bias of the (1+1) EA under the given duel probabilities, consistent with standard practice in theoretical evolutionary computation. However, without a bound, it is possible that the chain has not mixed after polynomially many steps when p = Ω(1/n). In the revised manuscript we will add an explicit discussion paragraph acknowledging this limitation, noting that the transition structure (Condorcet winner wins with probability 1-p) suggests a polynomial mixing time via standard drift or coupling arguments, while leaving a formal bound for future work. This constitutes a partial revision. revision: partial
-
Referee: [Theoretical claims and derivations] The abstract and claims reference specific probability bounds and stationary distributions derived from the binary stochastic dueling model, but the manuscript provides no explicit proofs, error analysis, or verification of the Markov-chain stationary distributions. This gap is load-bearing for assessing whether the constant-probability claim for the EA and the 1-Θ(p) claim for the EDA are correct.
Authors: The manuscript derives the stationary probabilities by solving the global balance equations of the respective Markov chains. For the (1+1) EA the state space is the current arm and the transition probabilities are exactly 1-p from any non-Condorcet arm to the Condorcet winner (and p in the reverse direction), yielding a closed-form stationary mass of Θ(1) on the Condorcet winner when p = Ω(1/n). For the Max-Min Ant System EDA the probability vector is updated by the iteration-best rule, producing the claimed 1-Θ(p) mass after a single update. We will include the complete step-by-step derivations, verification that the obtained vector satisfies πP = π, and a short error-bound discussion in an expanded appendix of the revised version, making the claims fully verifiable. revision: yes
Circularity Check
No circularity; stationary distributions derived directly from transition probabilities
full rationale
The paper models the (1+1) EA as a Markov chain on n arms with transitions determined by the fixed pairwise win probabilities (Condorcet winner wins with probability 1-p). The stationary distribution is obtained by solving the global balance equations under these transitions, yielding the claimed constant probability when p=Ω(1/n). The EDA result follows from its explicit per-iteration distribution update rule. No step reduces by construction to a fitted parameter, self-citation, or imported ansatz; all expressions are obtained from the standard dueling-bandit assumptions without circular redefinition or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Binary stochastic comparisons with fixed winning probabilities between every pair of arms
- domain assumption Existence of a Condorcet winner that beats all others with probability 1-p
Reference graph
Works this paper leans on
-
[1]
Arpit Agarwal, Nicholas Johnson, and Shivani Agarwal. 2020. Choice Bandits. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 18399–18410. https://proceedings.neurips.cc/paper_files/paper/2020/file/ d5fcc35c94879a4afad61cacca56192c-Paper.pdf
work page 2020
-
[2]
Denis Antipov, Benjamin Doerr, Jiefeng Fang, and Tangi Hetet. 2018. A tight runtime analysis for the (𝜇 + 𝜆) EA. InProceedings of the Genetic and Evolutionary Computation Conference(Kyoto, Japan)(GECCO ’18). Association for Computing Machinery, 1459–1466. doi:10.1145/3205455.3205627
-
[3]
Viktor Bengs, Robert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier
-
[4]
http://jmlr.org/papers/v22/18- 546.html
Preference-based Online Learning with Dueling Bandits: A Survey.Journal of Machine Learning Research22, 7 (2021), 1–108. http://jmlr.org/papers/v22/18- 546.html
work page 2021
-
[5]
Jasmin Brandt, Viktor Bengs, Björn Haddenhorst, and Eyke Hüllermeier. 2022. Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 20621–20634
work page 2022
-
[6]
Brian Brost, Yevgeny Seldin, Ingemar J. Cox, and Christina Lioma. 2016. Multi- Dueling Bandits and Their Application to Online Ranker Evaluation. InProceed- ings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM ’16). Association for Computing Machinery, 2161–2166. doi:10.1145/2983323.2983659
-
[7]
Brian Brost, Yevgeny Seldin, Ingemar J. Cox, and Christina Lioma. 2016. Multi- Dueling Bandits and Their Application to Online Ranker Evaluation. InProceed- ings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM ’16). Association for Computing Machinery, 2161–2166
work page 2016
-
[8]
Wei Chen, Yihan Du, Longbo Huang, and Haoyu Zhao. 2020. Combinatorial pure exploration for dueling bandits. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 143, 11 pages
work page 2020
-
[9]
Benjamin Doerr, Daniel Johannsen, and Carola Winzen. 2012. Multiplicative Drift Analysis.Algorithmica64, 4, 673–697. doi:10.1007/S00453-012-9622-X
-
[10]
Benjamin Doerr, Bojana Kodric, and Marco Voigt. 2013. Lower bounds for the runtime of a global multi-objective evolutionary algorithm. InProceedings of the IEEE Congress on Evolutionary Computation (CEC ’13). Institute of Electrical and Electronics Engineers Inc., 432–439. doi:10.1109/CEC.2013.6557601
-
[11]
Benjamin Doerr and Frank Neumann. 2020. Theory of Evolutionary Computa- tion: Recent Developments in Discrete Optimization. InTheory of Evolutionary Computation: Recent Developments in Discrete Optimization, Benjamin Doerr and Frank Neumann (Eds.). Springer. doi:10.1007/978-3-030-29414-4
-
[12]
Tobias Friedrich, Timo Kötzing, Aneta Neumann, Frank Neumann, and Aish- warya Radhakrishnan. 2025. Analysis of the (1+1) EA on LeadingOnes with Constraints.Algorithmica87, 5, 661–689. doi:10.1007/S00453-025-01298-9
-
[13]
Björn Haddenhorst, Viktor Bengs, and Eyke Hüllermeier. 2021. Identifica- tion of the Generalized Condorcet Winner in Multi-Dueling Bandits. InAd- vances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 25904–25916. https://proceedings.neurips.cc/paper_files/paper/2021/file/ d9de6a144a3cc26cb4b3c47b206a121a-Paper.pdf
work page 2021
-
[14]
Jens Jägersküpper. 2008. A Blend of Markov-Chain and Drift Analysis. InParallel Problem Solving from Nature (PPSN ’08). Springer, 41–51. doi:10.1007/978-3-540- 87700-4_5
-
[15]
Thomas Jansen and Christine Zarges. 2011. On benefits and drawbacks of aging strategies for randomized search heuristics.Theoretical Computer Science412, 6 (2011), 543–559. doi:10.1016/j.tcs.2010.03.032
-
[16]
Martin S. Krejca. 2019.Theoretical analyses of univariate estimation-of-distribution algorithms. Doctoral Thesis. Universität Potsdam. doi:10.25932/publishup-43487
-
[17]
Martin S. Krejca and Carsten Witt. 2020. Theory of Estimation-of-Distribution Algorithms. InTheory of Evolutionary Computation - Recent Developments in Discrete Optimization. Springer, 405–442. doi:10.1007/978-3-030-29414-4_9
- [18]
-
[19]
Per Kristian Lehre and Xin Yao. 2012. On the Impact of Mutation-Selection Balance on the Runtime of Evolutionary Algorithms.IEEE Transactions on Evolutionary Computation16, 2 (2012), 225–241. doi:10.1109/TEVC.2011.2112665
-
[20]
R.D. Luce. 1959.Individual Choice Behavior: A Theoretical Analysis. Wiley. https://books.google.de/books?id=a80DAQAAIAAJ
work page 1959
-
[21]
Michael Mitzenmacher and Eli Upfal. 2017.Probability and Computing: Random- ization and Probabilistic Techniques in Algorithms and Data Analysis(2nd ed.). Cambridge University Press
work page 2017
-
[22]
Soheil Mohajer, Changho Suh, and Adel Elmahdy. 2017. Active Learning for Top-𝐾 Rank Aggregation from Noisy Comparisons. InProceedings of the 34th International Conference on Machine Learning, Vol. 70. Proceedings of Machine Learning Research, 2488–2497. https://proceedings.mlr.press/v70/mohajer17a. html
work page 2017
-
[23]
Pietro S. Oliveto and Carsten Witt. 2011. Simplified Drift Analysis for Proving Lower Bounds in Evolutionary Computation.Algorithmica59, 3 (01 Mar 2011), 369–386. doi:10.1007/s00453-010-9387-z
-
[24]
R. L. Plackett. 1975. The Analysis of Permutations.Journal of the Royal Statistical Society Series C24, 2 (June 1975), 193–202. doi:10.2307/2346567
-
[25]
Wenbo Ren, Jia Liu, and Ness Shroff. 2020. The Sample Complexity of Best-𝑘 Items Selection from Pairwise Comparisons. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. Proceedings of Machine Learning Research, 8051–8072. https://proceedings.mlr.press/v119/ren20a.html
work page 2020
-
[26]
Wenbo Ren, Jia Liu, and Ness B. Shroff. 2019.On sample complexity upper and lower bounds for exact ranking from noisy comparisons. Curran Associates Inc
work page 2019
-
[27]
Günter Rudolph. 1998. Finite Markov Chain Results in Evolutionary Computation: A Tour d’Horizon.Fundamenta Informaticae35, 1–4 (Jan. 1998), 67–89
work page 1998
-
[28]
Aadirupa Saha and Aditya Gopalan. 2018. Battle of Bandits. InProceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence. AUAI Press, 805–814. http://auai.org/uai2018/proceedings/papers/290.pdf
work page 2018
-
[29]
Aadirupa Saha and Aditya Gopalan. 2020. From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. Proceedings of Machine Learning Research, 8367–8376. https://proceedings.mlr.press/v119/saha20b.html
work page 2020
-
[30]
Dirk Sudholt. 2009. The impact of parametrization in memetic evolutionary algorithms.Theoretical Computer Science410, 26 (June 2009), 2511–2528. doi:10. 1016/j.tcs.2009.03.003
work page 2009
-
[31]
Dirk Sudholt. 2011. Using markov-chain mixing time estimates for the analysis of ant colony optimization. InProceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms (FOGA ’11). Association for Computing Machinery, 139–150. doi:10.1145/1967654.1967667
-
[32]
Carsten Witt. 2006. Runtime Analysis of the (𝜇+1) EA on Simple Pseudo-Boolean Functions.Evolutionary Computation14, 1 (March 2006), 65–86. doi:10.1162/ 106365606776022751
work page 2006
-
[33]
Carsten Witt. 2008. Population size versus runtime of a simple evolutionary algorithm.Theoretical Computer Science403, 1 (2008), 104–120. doi:10.1016/j.tcs. 2008.05.011 Analysis of Search Heuristics in the Multi-Armed Bandit Setting GECCO Companion ’26, July 13–17, 2026, San Jose, Costa Rica
-
[34]
Huangjun Zhu, Zihao Li, and Masahito Hayashi. 2022. Nearly tight universal bounds for the binomial tail probabilities. (2022). arXiv:2211.01688 https://arxiv. org/abs/2211.01688 Acknowledgements This research was (partially) funded by the HPI Research School on Foundations of AI (FAI). This research was supported by the Ministry of Culture and Science (NR...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.