Regret Tail Characterization of Optimal Bandit Algorithms with Generic Rewards

Shubhada Agrawal; Subhodip Panda

arxiv: 2604.14876 · v1 · submitted 2026-04-16 · 💻 cs.IT · cs.LG· math.IT

Regret Tail Characterization of Optimal Bandit Algorithms with Generic Rewards

Subhodip Panda , Shubhada Agrawal This is my paper

Pith reviewed 2026-05-10 10:16 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords multi-armed banditsregret tailsKLinf-UCBnonparametric rewardsasymptotic optimalitystochastic banditstail probabilitiesinformation divergence

0 comments

The pith

Extending KLinf-UCB to nonparametric rewards yields asymptotically optimal expected regret and tight bounds on regret tail probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the KLinf-UCB algorithm from parametric to a broad nonparametric class of reward distributions under mild assumptions. It proves that this version remains asymptotically optimal for expected regret. The work then derives a new upper bound on the probability of large regret and shows that the bound is tight for the special case of finitely supported rewards. The results recover earlier guarantees for bounded and heavy-tailed rewards as special cases. This matters because algorithms that look optimal on average can still produce unacceptably large regret with positive probability.

Core claim

The paper establishes that the KLinf-UCB algorithm, when extended to a broad nonparametric class of reward distributions satisfying mild assumptions, is asymptotically optimal in expectation. It further derives a novel upper bound on the regret tail probability. For finitely-supported reward distributions this upper bound matches the known lower bound exactly. The same bound recovers the existing tail guarantees for both bounded-support and moment-bounded heavy-tailed models, thereby giving a unified characterization of regret tails for asymptotically optimal KL-based UCB algorithms beyond parametric families.

What carries the argument

The KLinf-UCB algorithm, which uses an information-theoretic divergence (KL infimum) between empirical reward estimates and hypothesized distributions to decide which arm to pull next.

If this is right

The regret tail bound recovers the known guarantees for both bounded-support and heavy-tailed (moment-bounded) bandit models.
For finitely-supported rewards the derived upper bound on tail probability equals the existing lower bound.
Asymptotic optimality in expectation continues to hold for the entire nonparametric class.
The results supply a single tight characterization for all asymptotically optimal KL-based UCB algorithms outside parametric settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tail-analysis technique may extend to other expectation-optimal bandit methods once they are expressed in nonparametric form.
If the mild assumptions are satisfied by real data, the bound gives a practical way to limit the chance of catastrophic regret episodes.
The exact matching for finite-support cases indicates that no KL-based UCB algorithm can improve tail behavior without sacrificing expected optimality.

Load-bearing premise

The reward distributions belong to a broad nonparametric class that satisfies the paper's mild assumptions.

What would settle it

Finding any reward distribution inside the assumed nonparametric class for which the observed frequency of large regret exceeds the stated upper bound on the tail probability would falsify the claim.

read the original abstract

We study the tail behavior of regret in stochastic multi-armed bandits for algorithms that are asymptotically optimal in expectation. While minimizing expected regret is the classical objective, recent work shows that even such algorithms can exhibit heavy regret tails, incurring large regret with non-negligible probability. Existing sharp characterizations of regret tails are largely restricted to parametric settings, such as single-parameter exponential families. In this work, we extend the $\KLinf$-UCB algorithm of to a broad nonparametric class of reward distributions satisfying mild assumptions, and establish its asymptotic optimality in expectation. We then analyze the tail behavior of its regret and derive a novel upper bound on the regret tail probability. As special cases, our results recover regret-tail guarantees for both bounded-support and heavy-tailed (moment-bounded) bandit models. Moreover, for the special case of finitely-supported reward distributions, our upper bound matches the known lower bound exactly. Our results thus provide a unified and tight characterization of regret tails for asymptotically optimal KL-based UCB algorithms, going beyond parametric models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly extends KLinf-UCB to nonparametric rewards and nails an exact-matching tail bound for finite-support cases, but the heavy-tailed recovery rests on assumptions that may need more scrutiny than the abstract suggests.

read the letter

The main point is that this work takes the KLinf-UCB algorithm, moves it out of parametric exponential families, and shows both asymptotic optimality in expectation and a new upper bound on the probability of large regret. For finitely supported rewards the upper bound matches the known lower bound exactly, which is a tidy result. It also claims the same framework recovers the tail guarantees already known for bounded rewards and for moment-bounded heavy-tailed rewards under one set of mild nonparametric conditions.

Referee Report

2 major / 1 minor

Summary. The manuscript extends the KLinf-UCB algorithm to a broad nonparametric class of reward distributions under mild assumptions, proves its asymptotic optimality in expectation, and derives a novel upper bound on the regret tail probability. Special cases recover existing guarantees for bounded-support and moment-bounded heavy-tailed models; for finitely supported rewards the upper bound matches the known lower bound exactly.

Significance. If the central claims hold under the stated assumptions, the work supplies a unified nonparametric characterization of regret tails for asymptotically optimal KL-based UCB algorithms, with an exact match to the lower bound in the finite-support case. This strengthens the literature beyond parametric exponential-family settings.

major comments (2)

[§4 (tail bound proof)] §4 (or the section containing the tail-probability proof): the concentration inequality invoked for the regret-tail upper bound relies on exponential-moment or sub-exponential tail control of the empirical infimum estimator. The mild nonparametric assumptions (finite p-moments for p>1) do not automatically deliver this rate without explicit truncation or truncation-error bounds; the heavy-tailed recovery therefore appears to require stronger implicit conditions than those stated.
[§3 (class definition)] Definition of the nonparametric class (early in §3): the precise conditions under which the KLinf-UCB extension remains well-defined and the asymptotic optimality proof goes through are not stated with sufficient explicitness to verify that they suffice for both the expectation optimality and the tail bound simultaneously.

minor comments (1)

[§2-3] Notation for the generic reward class and the associated KLinf functional is introduced without a compact summary table; a single reference table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive evaluation of the work's significance. We address each of the major comments in detail below.

read point-by-point responses

Referee: [§4 (tail bound proof)] §4 (or the section containing the tail-probability proof): the concentration inequality invoked for the regret-tail upper bound relies on exponential-moment or sub-exponential tail control of the empirical infimum estimator. The mild nonparametric assumptions (finite p-moments for p>1) do not automatically deliver this rate without explicit truncation or truncation-error bounds; the heavy-tailed recovery therefore appears to require stronger implicit conditions than those stated.

Authors: We appreciate this observation regarding the tail bound proof. Upon re-examination, the proof in §4 does incorporate a truncation step to handle the finite p-moment assumptions and derive the necessary sub-exponential concentration for the empirical infimum. However, the truncation error analysis is not presented in sufficient detail. In the revised manuscript, we will expand §4 to include explicit bounds on the truncation error and demonstrate how the p-moment condition (p > 1) suffices to control these errors without requiring stronger assumptions. This will also clarify the recovery for heavy-tailed models. revision: yes
Referee: [§3 (class definition)] Definition of the nonparametric class (early in §3): the precise conditions under which the KLinf-UCB extension remains well-defined and the asymptotic optimality proof goes through are not stated with sufficient explicitness to verify that they suffice for both the expectation optimality and the tail bound simultaneously.

Authors: We agree that the definition of the nonparametric class in §3 could benefit from greater explicitness. The current statement lists the mild assumptions, but to facilitate verification, we will revise the definition to separately specify the conditions for (i) the well-definedness of the KLinf-UCB algorithm, (ii) the asymptotic optimality in expectation, and (iii) the regret tail bound. This will include precise statements on moment conditions and any regularity requirements, ensuring they are consistent across the analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation extends prior algorithm and derives independent tail bound

full rationale

The paper extends the existing KLinf-UCB algorithm to a nonparametric reward class under stated mild assumptions, proves asymptotic optimality in expectation, and derives a new upper bound on regret tail probability. For finitely-supported rewards the bound is shown to match a known lower bound (a verification step, not a reduction). No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The heavy-tailed recovery is presented as a special case of the general derivation rather than an input. The chain is self-contained against external benchmarks and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; free parameters, axioms, and invented entities cannot be audited without the full manuscript.

pith-pipeline@v0.9.0 · 5476 in / 1166 out tokens · 12617 ms · 2026-05-10T10:16:19.814371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Agrawal, R. (1995). Sample mean based index policies with o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , pages 1054--1078

work page 1995
[2]

Agrawal, S. (2022). Bandits with Heavy Tails: Algorithms, Analysis & Optimality . PhD thesis, Tata Institute of Fundamental Research (TIFR), Mumbai, Mumbai, India. Doctoral thesis

work page 2022
[3]

and Goyal, N

Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory , pages 39--1. JMLR Workshop and Conference Proceedings

work page 2012
[4]

and Goyal, N

Agrawal, S. and Goyal, N. (2013). Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics , pages 99--107. PMLR

work page 2013
[5]

Agrawal, S., Juneja, S., and Glynn, P. (2020). Optimal -correct best-arm selection for heavy-tailed distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory , volume 117 of Proceedings of Machine Learning Research , pages 61--110. PMLR

work page 2020
[6]

K., and Koolen, W

Agrawal, S., Juneja, S. K., and Koolen, W. M. (2021a). Regret minimization in heavy-tailed bandits. In Belkin, M. and Kpotufe, S., editors, Proceedings of the Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 26--62. PMLR

work page
[7]

M., and Juneja, S

Agrawal, S., Koolen, W. M., and Juneja, S. (2021b). Optimal best-arm identification methods for tail-risk measures. Advances in Neural Information Processing Systems , 34:25578--25590

work page
[8]

Ashutosh, K., Nair, J., Kagrecha, A., and Jagannathan, K. (2021). Bandit algorithms: Letting go of logarithmic regret for statistical robustness. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics , Proceedings of Machine Learning Research

work page 2021
[9]

Audibert, J.-Y., Munos, R., and Szepesv \'a ri, C. (2009). Exploration--exploitation tradeoff using variance estimates in multi--armed bandits. Theoretical Computer Science , 410(19):1876--1902

work page 2009
[10]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2):235--256

work page 2002
[11]

Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122

work page 2012
[12]

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tails. IEEE Transactions on Information Theory , 59(11):7711--7717

work page 2013
[13]

Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17(2):122 -- 142

work page 1996
[14]

Capp \'e , O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback-- L eibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics , 41(3):1516--1541

work page 2013
[15]

Csisz \'a r, I. (2006). A simple proof of sanov’s theorem. Bulletin of the Brazilian Mathematical Society, New Series , 37(3):453--459

work page 2006
[16]

and Zeitouni, O

Dembo, A. and Zeitouni, O. (2010). Large deviations techniques and applications. corrected reprint of the second (1998) edition. stochastic modelling and applied probability, 38

work page 2010
[17]

and Glynn, P

Fan, L. and Glynn, P. W. (2022). The typical behavior of bandit algorithms. CoRR , abs/2210.05660

work page arXiv 2022
[18]

and Glynn, P

Fan, L. and Glynn, P. W. (2024). The fragility of optimized bandit algorithms. Operations Research

work page 2024
[19]

and Takemura, A

Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In In Proceedings of the Twenty-third Conference on Learning Theory (COLT 2010 , pages 67--79. Omnipress

work page 2010
[20]

and Takemura, A

Honda, J. and Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning , 85(3):361--391

work page 2011
[21]

and Takemura, A

Honda, J. and Takemura, A. (2015). Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. The Journal of Machine Learning Research , 16(1):3721--3756

work page 2015
[22]

and Zeevi, A

Kalvit, A. and Zeevi, A. (2021). A closer look at the worst-case behavior of multi-armed bandit algorithms. In Advances in Neural Information Processing Systems , volume 34

work page 2021
[23]

Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory , pages 199--213. Springer

work page 2012
[24]

and Robbins, H

Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4 -- 22

work page 1985
[25]

Lai, T. L. (1987). Adaptive treatment allocation and the multi--armed bandit problem. The Annals of Statistics , 15(3):1091--1114

work page 1987
[26]

and Szepesv \'a ri, C

Lattimore, T. and Szepesv \'a ri, C. (2020). Bandit algorithms . Cambridge University Press

work page 2020
[27]

and Honda, J

Riou, C. and Honda, J. (2020). Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory , pages 777--826. PMLR

work page 2020
[28]

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. , 58(5):527--535

work page 1952
[29]

and Audibert, J.-Y

Salomon, A. and Audibert, J.-Y. (2011). Deviations of stochastic bandit regret. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT) , pages 159--173

work page 2011
[30]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933

[1] [1]

Agrawal, R. (1995). Sample mean based index policies with o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , pages 1054--1078

work page 1995

[2] [2]

Agrawal, S. (2022). Bandits with Heavy Tails: Algorithms, Analysis & Optimality . PhD thesis, Tata Institute of Fundamental Research (TIFR), Mumbai, Mumbai, India. Doctoral thesis

work page 2022

[3] [3]

and Goyal, N

Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory , pages 39--1. JMLR Workshop and Conference Proceedings

work page 2012

[4] [4]

and Goyal, N

Agrawal, S. and Goyal, N. (2013). Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics , pages 99--107. PMLR

work page 2013

[5] [5]

Agrawal, S., Juneja, S., and Glynn, P. (2020). Optimal -correct best-arm selection for heavy-tailed distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory , volume 117 of Proceedings of Machine Learning Research , pages 61--110. PMLR

work page 2020

[6] [6]

K., and Koolen, W

Agrawal, S., Juneja, S. K., and Koolen, W. M. (2021a). Regret minimization in heavy-tailed bandits. In Belkin, M. and Kpotufe, S., editors, Proceedings of the Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 26--62. PMLR

work page

[7] [7]

M., and Juneja, S

Agrawal, S., Koolen, W. M., and Juneja, S. (2021b). Optimal best-arm identification methods for tail-risk measures. Advances in Neural Information Processing Systems , 34:25578--25590

work page

[8] [8]

Ashutosh, K., Nair, J., Kagrecha, A., and Jagannathan, K. (2021). Bandit algorithms: Letting go of logarithmic regret for statistical robustness. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics , Proceedings of Machine Learning Research

work page 2021

[9] [9]

Audibert, J.-Y., Munos, R., and Szepesv \'a ri, C. (2009). Exploration--exploitation tradeoff using variance estimates in multi--armed bandits. Theoretical Computer Science , 410(19):1876--1902

work page 2009

[10] [10]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2):235--256

work page 2002

[11] [11]

Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122

work page 2012

[12] [12]

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tails. IEEE Transactions on Information Theory , 59(11):7711--7717

work page 2013

[13] [13]

Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17(2):122 -- 142

work page 1996

[14] [14]

Capp \'e , O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback-- L eibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics , 41(3):1516--1541

work page 2013

[15] [15]

Csisz \'a r, I. (2006). A simple proof of sanov’s theorem. Bulletin of the Brazilian Mathematical Society, New Series , 37(3):453--459

work page 2006

[16] [16]

and Zeitouni, O

Dembo, A. and Zeitouni, O. (2010). Large deviations techniques and applications. corrected reprint of the second (1998) edition. stochastic modelling and applied probability, 38

work page 2010

[17] [17]

and Glynn, P

Fan, L. and Glynn, P. W. (2022). The typical behavior of bandit algorithms. CoRR , abs/2210.05660

work page arXiv 2022

[18] [18]

and Glynn, P

Fan, L. and Glynn, P. W. (2024). The fragility of optimized bandit algorithms. Operations Research

work page 2024

[19] [19]

and Takemura, A

Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In In Proceedings of the Twenty-third Conference on Learning Theory (COLT 2010 , pages 67--79. Omnipress

work page 2010

[20] [20]

and Takemura, A

Honda, J. and Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning , 85(3):361--391

work page 2011

[21] [21]

and Takemura, A

Honda, J. and Takemura, A. (2015). Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. The Journal of Machine Learning Research , 16(1):3721--3756

work page 2015

[22] [22]

and Zeevi, A

Kalvit, A. and Zeevi, A. (2021). A closer look at the worst-case behavior of multi-armed bandit algorithms. In Advances in Neural Information Processing Systems , volume 34

work page 2021

[23] [23]

Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory , pages 199--213. Springer

work page 2012

[24] [24]

and Robbins, H

Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4 -- 22

work page 1985

[25] [25]

Lai, T. L. (1987). Adaptive treatment allocation and the multi--armed bandit problem. The Annals of Statistics , 15(3):1091--1114

work page 1987

[26] [26]

and Szepesv \'a ri, C

Lattimore, T. and Szepesv \'a ri, C. (2020). Bandit algorithms . Cambridge University Press

work page 2020

[27] [27]

and Honda, J

Riou, C. and Honda, J. (2020). Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory , pages 777--826. PMLR

work page 2020

[28] [28]

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. , 58(5):527--535

work page 1952

[29] [29]

and Audibert, J.-Y

Salomon, A. and Audibert, J.-Y. (2011). Deviations of stochastic bandit regret. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT) , pages 159--173

work page 2011

[30] [30]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933