pith. sign in

arxiv: 2510.22819 · v2 · submitted 2025-10-26 · 💻 cs.LG

Last-Iterate Analyses of FTRL with the 1/2-Tsallis Entropy in Stochastic Bandits

Pith reviewed 2026-05-18 03:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords stochastic banditsFTRLTsallis entropylast-iterate convergenceBregman divergencesimple regretconvergence rate
0
0 comments X

The pith

The 1/2-Tsallis entropy regularizer in FTRL makes Bregman divergence to the optimal arm decay as t to the power of negative one-half.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the last-iterate behavior of Follow-the-Regularized-Leader with the 1/2-Tsallis entropy in stochastic multi-armed bandits. It shows that the Bregman divergence between the point mass on the optimal arm and the algorithm's distribution at time t shrinks proportionally to t to the negative one-half. This rate is slower than the t to the negative one that logarithmic regret might intuitively suggest, yet it gives a direct description of how decisions evolve toward optimality. A reader would care because last-iterate analysis tracks the actual sequence of choices rather than only average performance over time.

Core claim

The paper proves that the 1/2-Tsallis-INF algorithm produces distributions p_t such that the Bregman divergence D_Ψ(e_{i*}, p_t) decays at rate O(t^{-1/2}), where Ψ(p) = -4 sum sqrt(p_i) and i* denotes the optimal arm.

What carries the argument

The Bregman divergence induced by the 1/2-Tsallis entropy regularizer Ψ(p) = -4 sum sqrt(p_i), which tracks the distance from the current distribution to the optimal point mass.

If this is right

  • The probability placed on the optimal arm increases such that the divergence to optimality shrinks as the inverse square root of time.
  • This last-iterate rate holds without extra assumptions on reward gaps or variance.
  • The result supplies a precise path-wise view that complements the known logarithmic regret bound for the same algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap between the observed t to the negative one-half rate and the intuitively expected t to the negative one suggests that different regularizers may be needed for faster last-iterate guarantees.
  • The same divergence analysis technique could be applied to other entropy regularizers to compare their convergence paths in bandit settings.
  • Empirical checks on small bandit instances could verify whether the predicted square-root decay appears in finite samples.

Load-bearing premise

Reward distributions are fixed and unknown with independent draws each round, which lets the regularizer properties connect directly to the observed divergence decay.

What would settle it

Running the algorithm on a two-arm stochastic bandit with known rewards and plotting the measured Bregman divergence versus time; absence of decay proportional to one over square root of t would contradict the rate.

read the original abstract

The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored -- especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, FTRL with the $1/2$-Tsallis entropy regularizer $\Psi(p) = -4\sum_{i=1}^d \sqrt{p_i}$ (the $1/2$-Tsallis-INF algorithm, by arXiv:1807.07623) was shown to achieve logarithmic regret in stochastic bandits. Nevertheless, its last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper studies the $1/2$-Tsallis-INF algorithm and partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by $\Psi(p)$, between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper analyzes last-iterate convergence for the 1/2-Tsallis-INF algorithm (FTRL with regularizer Ψ(p) = -4 ∑ sqrt(p_i)) in stochastic multi-armed bandits. It claims that the Bregman divergence D_Ψ(e_{i*}, p_t) between the point mass on the optimal arm and the iterate distribution p_t decays at rate t^{-1/2}, partially addressing the intuition that logarithmic regret should imply a faster t^{-1} rate.

Significance. If the central rate holds, the result supplies a concrete last-iterate guarantee for an algorithm already known to achieve logarithmic regret, thereby linking cumulative performance to the evolution of the decision distribution itself. This is a useful addition to the literature on FTRL analyses in bandits.

major comments (1)
  1. The manuscript states that the analysis is performed under the standard stochastic MAB model (fixed unknown reward distributions, independent draws) without additional assumptions on reward gaps or variance. However, the claim that D_Ψ(e_{i*}, p_t) decays as t^{-1/2} presupposes a unique optimal arm i* and a strictly positive gap Δ > 0 that supplies consistent directional drift in the linear loss term. When Δ = 0 all arms are optimal, no unique e_{i*} exists, and p_t need not converge to any vertex; the derivation therefore cannot close without exploiting a positive gap. This assumption is load-bearing for the rate claim and must be stated explicitly or the theorem restated to cover the zero-gap case.
minor comments (1)
  1. The abstract notes that intuition suggested t^{-1} but the analysis yields t^{-1/2}; a short paragraph explaining the source of the slower rate (e.g., properties of the Bregman divergence or the specific regularizer) would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The single major comment is addressed below. We agree that the positive-gap assumption is implicit in the current presentation and will make it explicit.

read point-by-point responses
  1. Referee: The manuscript states that the analysis is performed under the standard stochastic MAB model (fixed unknown reward distributions, independent draws) without additional assumptions on reward gaps or variance. However, the claim that D_Ψ(e_{i*}, p_t) decays as t^{-1/2} presupposes a unique optimal arm i* and a strictly positive gap Δ > 0 that supplies consistent directional drift in the linear loss term. When Δ = 0 all arms are optimal, no unique e_{i*} exists, and p_t need not converge to any vertex; the derivation therefore cannot close without exploiting a positive gap. This assumption is load-bearing for the rate claim and must be stated explicitly or the theorem restated to cover the zero-gap case.

    Authors: We thank the referee for this observation. The analysis does rely on a unique optimal arm i* together with a strictly positive gap Δ > 0; this gap produces the persistent directional bias in the expected linear losses that drives D_Ψ(e_{i*}, p_t) to zero at rate t^{-1/2}. When all arms are optimal (Δ = 0), the target e_{i*} is not well-defined and convergence of p_t to a vertex need not occur. We will revise the manuscript to state the unique-optimal-arm and positive-gap assumptions explicitly in the problem setup, theorem statements, and discussion. The core last-iterate result and its proof remain unchanged under these clarified conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is independent theoretical analysis

full rationale

The paper presents a direct theoretical derivation of the last-iterate convergence rate for the 1/2-Tsallis-INF algorithm under the standard stochastic multi-armed bandit model. The claimed t^{-1/2} decay of the Bregman divergence D_Ψ(e_{i*}, p_t) follows from the FTRL update dynamics and the specific form of the regularizer Ψ(p) = -4 ∑ sqrt(p_i), without reducing any load-bearing step to a fitted parameter, self-referential definition, or unverified self-citation chain. The analysis cites prior work on the algorithm's regret properties but treats it as external input rather than deriving the new last-iterate result from it by construction. No equations or steps in the provided derivation chain exhibit the enumerated circularity patterns; the result has independent mathematical content relative to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard stochastic bandit assumptions rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Rewards are drawn independently from fixed but unknown distributions for each arm.
    This stationarity is required for the regret-to-divergence relation to hold over time.

pith-pipeline@v0.9.0 · 5781 in / 1339 out tokens · 54365 ms · 2026-05-18T03:59:22.752763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

    cs.LG 2026-05 unverdicted novelty 8.0

    With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.