With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
Last-Iterate Analyses of FTRL with the 1/2-Tsallis Entropy in Stochastic Bandits
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored -- especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, FTRL with the $1/2$-Tsallis entropy regularizer $\Psi(p) = -4\sum_{i=1}^d \sqrt{p_i}$ (the $1/2$-Tsallis-INF algorithm, by arXiv:1807.07623) was shown to achieve logarithmic regret in stochastic bandits. Nevertheless, its last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper studies the $1/2$-Tsallis-INF algorithm and partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by $\Psi(p)$, between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.