pith. sign in

arxiv: 2606.09191 · v1 · pith:Q7CJZPVNnew · submitted 2026-06-08 · 💻 cs.LG · stat.ML

Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards

Pith reviewed 2026-06-27 17:16 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords risk-averse banditsThompson Samplingasymptotic optimalitysub-Gaussian rewardscontinuous risk functionalsnonparametric algorithmsregret boundsCVaR
0
0 comments X

The pith

A nonparametric Thompson Sampling algorithm matches the instance-dependent regret lower bound for risk-averse bandits under any continuous risk functional on sub-Gaussian distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that its anchor-free nonparametric Thompson Sampling procedure attains regret matching the instance-dependent lower bound to leading order in log n. This asymptotic optimality result applies to any continuous risk functional on reward distributions with bounded density and sub-Gaussian tails, including common choices such as CVaR, mean-variance, Sharpe ratio, and distortion risk measures. The proof requires only continuity of the risk functional, a strictly weaker condition than the dominance or Lipschitz assumptions used in prior work, and it covers both bounded-support and sub-Gaussian settings. The technical advance consists of discretisation lemmas that project the Dirichlet posterior onto a fixed grid while keeping all polynomial prefactors of fixed degree independent of sample size. Readers would care because the result removes previous barriers to achieving instance-optimal performance in risk-sensitive sequential decision problems without imposing parametric reward forms.

Core claim

We prove that ρ-NPTS_SG achieves regret matching the instance-dependent lower bound to leading order in log n, establishing it as asymptotically optimal for any continuous risk functional ρ on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of ρ: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The key technical con

What carries the argument

The truncated discretisation lemma, which projects the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property while keeping all polynomial prefactors of fixed degree independent of sample size.

If this is right

  • The algorithm attains asymptotic optimality for CVaR, mean-variance, Sharpe ratio, and distortion risk measures on sub-Gaussian arms.
  • Instance-optimal guarantees now hold for non-Lipschitz risk functionals without requiring parametric reward distributions.
  • The same proof structure covers both the bounded-support case and the sub-Gaussian case.
  • Only continuity of ρ is needed, removing the need for dominance or Lipschitz conditions used in earlier analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuity-only condition may allow the same optimality result to apply to additional risk measures not previously analyzed.
  • The discretisation technique could be adapted to other nonparametric bandit settings where the support grows with the number of observations.
  • Practical risk-averse systems could adopt the algorithm directly without first verifying Lipschitz properties of their chosen risk measure.

Load-bearing premise

The risk functional must be continuous.

What would settle it

An explicit construction of a continuous risk functional and a family of sub-Gaussian distributions where the regret of ρ-NPTS_SG grows faster than the instance-dependent lower bound by more than lower-order terms in log n would falsify the asymptotic optimality claim.

Figures

Figures reproduced from arXiv: 2606.09191 by Joel Q. L. Chang.

Figure 1
Figure 1. Figure 1: Cumulative regret (mean ± 1-σ band, 200 trials) of ρ-NPTS (solid green) and Risk-LCB [Tan et al., 2022] (dashed orange, Experiment 1 only) against the asymptotic lower bound (dotted red, Theorem 1). All K = 2, n = 10,000. Experi￾ments 1 and 3 use Beta arms Experiment 2 uses TruncNorm arms. 0 2000 4000 6000 8000 10000 Rounds n 0 10 20 30 40 50 60 70 Cumulative regret Sharpe, Gaussian: N (1.5, 1) vs N (1.0, … view at source ↗
Figure 2
Figure 2. Figure 2: Experiment 4: Sharpe ratio on Gaussian arms [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

We prove that $\rho\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $\rho$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $\rho$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proves that ρ-NPTS_SG, an anchor-free nonparametric Thompson Sampling algorithm, achieves regret matching the instance-dependent lower bound to leading order in log n for any continuous risk functional ρ (including CVaR, mean-variance, Sharpe ratio, and distortion risk measures) on the class of distributions with bounded density and sub-Gaussian tails. The bounded-support case is developed first, followed by the sub-Gaussian case; both rely only on continuity of ρ. The key technical tools are a discretization lemma (bounded support) and a truncated discretization lemma (sub-Gaussian tails) that project the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property while keeping all polynomial prefactors of bounded degree independent of n.

Significance. If the result holds, the work establishes the first instance-optimal guarantees for non-Lipschitz risk functionals such as the Sharpe ratio without parametric reward assumptions or dominance/Lipschitz conditions on ρ. The discretization lemmas that control prefactors independently of sample size and break the super-exponential barrier constitute a clear technical advance over prior parametric Thompson Sampling analyses.

minor comments (2)
  1. The abstract states that the bounded-support case is a 'stepping stone' sharing the same proof structure; a brief forward reference to the section containing the truncated discretization lemma would help readers track the shared steps.
  2. Notation for the algorithm (ρ-NPTS_SG) and the risk functional ρ is introduced in the abstract without an immediate pointer to the formal definition; adding a parenthetical reference to the relevant section would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, for highlighting the significance of the discretization lemmas and the instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio, and for recommending acceptance.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation establishes asymptotic optimality via new discretization and truncated discretization lemmas that project the Dirichlet posterior onto a fixed grid using the Dirichlet aggregation property, with all polynomial prefactors of bounded degree independent of n. These lemmas are presented as original technical contributions that break the super-exponential barrier, directly supporting the regret bound matching the instance-dependent lower bound to leading order in log n under the sole assumption of continuity of ρ. No step reduces by construction to a fitted parameter, self-citation chain, ansatz smuggled via prior work, or renamed empirical pattern; the central claim retains independent content from the stated assumptions and lower bound.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on continuity of the risk functional and the sub-Gaussian plus bounded-density assumptions on rewards; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The risk functional ρ is continuous.
    Stated as the minimal condition sufficient for the asymptotic optimality result.
  • domain assumption Reward distributions have bounded density and sub-Gaussian tails.
    Defines the class of distributions for which the regret bound holds, including Gaussians.

pith-pipeline@v0.9.1-grok · 5747 in / 1311 out tokens · 23235 ms · 2026-06-27T17:16:12.144730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages

  1. [1]

    Ang, Ming Liang and Lim, Eloise Y. Y. and Chang, Joel Q. L. , year =. Thompson Sampling for. 2105.06960 , archivePrefix=

  2. [2]

    Order Optimal Regret Bounds for

    Shah, Mohammad Taha and Khurshid, Sabrina and Ghatak, Gourab , year =. Order Optimal Regret Bounds for. 2508.13749 , archivePrefix=

  3. [3]

    Advances in Applied Mathematics , volume =

    Asymptotically efficient adaptive allocation rules , author =. Advances in Applied Mathematics , volume =

  4. [4]

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-ECAI) , year =

    A Survey of Risk-Aware Multi-Armed Bandits , author =. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-ECAI) , year =

  5. [5]

    Chang, Joel Q. L. and Tan, Vincent Y. F. , booktitle =. A Unifying Theory of. 2022 , volume =

  6. [6]

    A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems , year=

    Granmo, Ole-Christoffer , booktitle=. A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems , year=

  7. [7]

    2012 , eprint=

    Thompson Sampling: An Asymptotically Optimal Finite Time Analysis , author=. 2012 , eprint=

  8. [8]

    2021 , eprint=

    From Optimality to Robustness: Dirichlet Sampling Strategies in Stochastic Bandits , author=. 2021 , eprint=

  9. [9]

    and Wiley, J

    Huber, P.J. and Wiley, J. and InterScience, W. , biburl =

  10. [10]

    Halmos , title =

    Paul R. Halmos , title =. Bulletin of the American Mathematical Society , number =. 1959 , doi =

  11. [11]

    Theory of Probability and Its Applications , year=

    Convergence of Random Processes and Limit Theorems in Probability Theory , author=. Theory of Probability and Its Applications , year=

  12. [12]

    , journal=

    Posner, E. , journal=. Random coding strategies for minimum entropy , year=

  13. [13]

    Proceedings of the 31st Conference On Learning Theory , pages =

    A General Approach to Multi-Armed Bandits Under Risk Criteria , author =. Proceedings of the 31st Conference On Learning Theory , pages =. 2018 , editor =

  14. [14]

    1985 , issn =

    Asymptotically efficient adaptive allocation rules , journal =. 1985 , issn =. doi:https://doi.org/10.1016/0196-8858(85)90002-8 , url =

  15. [15]

    Proceedings of the 25th Annual Conference on Learning Theory , pages =

    Analysis of Thompson Sampling for the Multi-armed Bandit Problem , author =. Proceedings of the 25th Annual Conference on Learning Theory , pages =. 2012 , editor =

  16. [16]

    Learning Bounds for Risk-sensitive Learning , url =

    Lee, Jaeho and Park, Sejun and Shin, Jinwoo , booktitle =. Learning Bounds for Risk-sensitive Learning , url =

  17. [17]

    Continuous Mean-Covariance Bandits , year =

    Yihan Du and Siwei Wang and Zhixuan Fang and Longbo Huang , booktitle =. Continuous Mean-Covariance Bandits , year =

  18. [18]

    Mean-variance and value at risk in multi-armed bandit problems , year=

    Vakili, Sattar and Zhao, Qing , booktitle=. Mean-variance and value at risk in multi-armed bandit problems , year=

  19. [19]

    Proceedings of the 5th Asian Conference on Machine Learning , pages =

    Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , author =. Proceedings of the 5th Asian Conference on Machine Learning , pages =. 2013 , editor =

  20. [20]

    2009 , publisher=

    Large Deviations Techniques and Applications , author=. 2009 , publisher=

  21. [21]

    Bernoulli , pages=

    A large-deviation principle for Dirichlet posteriors , author=. Bernoulli , pages=. 2000 , publisher=

  22. [22]

    Koolen and Sandeep Juneja , title =

    Shubhada Agrawal and Wouter M. Koolen and Sandeep Juneja , title =. CoRR , volume =. 2020 , url =

  23. [23]

    2021 , eprint=

    Off-Policy Risk Assessment in Contextual Bandits , author=. 2021 , eprint=

  24. [24]

    International Conference on Machine Learning , pages=

    Thompson Sampling Algorithms for Mean-Variance Bandits , author=. International Conference on Machine Learning , pages=

  25. [25]

    Advances in Neural Information Processing Systems , pages=

    Risk-aversion in multi-armed bandits , author=. Advances in Neural Information Processing Systems , pages=

  26. [26]

    Distributionally-Aware Exploration for

    Tamkin, Alex and Keramati, Ramtin and Dann, Christoph and Brunskill, Emma , booktitle=. Distributionally-Aware Exploration for

  27. [27]

    A revised approach for risk-averse multi-armed bandits under CVaR criterion , journal =

    Najakorn Khajonchotpanya and Yilin Xue and Napat Rujeerapaiboon , keywords =. A revised approach for risk-averse multi-armed bandits under CVaR criterion , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.orl.2021.05.005 , url =

  28. [28]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Optimal Thompson Sampling strategies for support-aware CVaR bandits , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  29. [29]

    2021 , eprint=

    Risk-Constrained Thompson Sampling for CVaR Bandits , author=. 2021 , eprint=

  30. [30]

    Mathematical Finance , volume=

    Coherent measures of risk , author=. Mathematical Finance , volume=. 1999 , publisher=

  31. [31]

    Proceedings of the 31st International Conference on Algorithmic Learning Theory , pages =

    Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions , author =. Proceedings of the 31st International Conference on Algorithmic Learning Theory , pages =. 2020 , editor =

  32. [32]

    Premium Calculation by Transforming the Layer Premium Density , volume=

    Wang, Shaun , year=. Premium Calculation by Transforming the Layer Premium Density , volume=. ASTIN Bulletin , publisher=. doi:10.2143/AST.26.1.563234 , number=

  33. [33]

    Mathematics of Operations Research , volume =

    Garivier, Aur\'elien and M\'enard, Pierre and Stoltz, Gilles , title =. Mathematics of Operations Research , volume =. 2019 , doi =. https://doi.org/10.1287/moor.2017.0928 , abstract =

  34. [34]

    Biometrika , volume=

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples , author=. Biometrika , volume=. 1933 , publisher=

  35. [35]

    On the lookback distortion risk measure: theory and applications , volume =

    Hürlimann, Werner , year =. On the lookback distortion risk measure: theory and applications , volume =

  36. [36]

    Massart , title =

    P. Massart , title =. The Annals of Probability , number =. 1990 , doi =

  37. [37]

    Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

    Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

  38. [38]

    Classification Problem Solving

    Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

  39. [39]

    , title =

    Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

  40. [40]

    New Ways to Make Microcircuits Smaller---Duplicate Entry

    Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

  41. [41]

    Clancey and Glenn Rennels , abstract =

    Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

  42. [42]

    and Rennels, Glenn R

    Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

  43. [43]

    Poligon: A System for Parallel Problem Solving

    Rice, James. Poligon: A System for Parallel Problem Solving

  44. [44]

    Transfer of Rule-Based Expertise through a Tutorial Dialogue

    Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

  45. [45]

    The Engineering of Qualitative Models

    Clancey, William J. The Engineering of Qualitative Models

  46. [46]

    2008 , eprint=

    Crime and punishment in scientific research , author=. 2008 , eprint=

  47. [47]

    Pluto: The 'Other' Red Planet

    NASA. Pluto: The 'Other' Red Planet