pith. sign in

arxiv: 2605.23351 · v1 · pith:WVRCR5ZRnew · submitted 2026-05-22 · 💻 cs.LG · cs.GT

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

Pith reviewed 2026-05-25 05:25 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords adversarial banditsdelayed feedbacksafety-aware regretonline mirror descentphased aggressionregret lower bounds
0
0 comments X

The pith

Prudent-Banker achieves minimax-optimal regret in adversarial bandits with delays while keeping constant regret to a safe baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prudent-Banker for adversarial multi-armed bandits with and without delayed feedback. It targets a safety-aware goal of minimax-optimal worst-case regret paired with nearly constant regret relative to a designated safe baseline policy. Delays risk mistiming the switch between conservative and exploratory phases in prior methods. The algorithm pairs a delay-adapted Online Mirror Descent with a modified phased-aggression scheme whose restart threshold is calibrated to the maximum possible distortion from missing observations. This yields the claimed pseudo-regret of order square root of T plus square root of D together with constant regret against the safe comparator, and the bounds match new lower bounds established in the work.

Core claim

Prudent-Banker is the first algorithm to achieve the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against the safe comparator, both with and without delays. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. The paper also proves matching lower bounds showing these guarantees are unimprovable up to logarithmic factors under the baseline-safety requirement.

What carries the argument

The delay-calibrated restart threshold inside the phased-aggression mechanism, which detects when the safe comparator is suboptimal while compensating for the worst-case effect of unobserved delayed feedback.

If this is right

  • The same regret guarantees apply whether feedback is delayed or immediate.
  • The bounds are tight up to logarithmic factors under the baseline-safety requirement.
  • The algorithm balances safety and learning across diverse delay distributions where standard delay-robust baselines do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same restart-calibration idea could be tested in other partial-feedback settings where timing of information affects safety constraints.
  • If the threshold construction generalizes, safety constraints may be incorporable into many existing online algorithms without inflating the leading regret term.

Load-bearing premise

The delay-calibrated restart threshold accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality.

What would settle it

A sequence of delayed adversarial bandit instances in which Prudent-Banker incurs more than constant regret against the safe comparator would show the main guarantee does not hold.

Figures

Figures reproduced from arXiv: 2605.23351 by Emmanouil-Vasileios Vlatakis-Gkaragkounis, Luanda Cai, Ting Hu.

Figure 1
Figure 1. Figure 1: Dynamic performance of PRUDENT-BANKER versus delay-aware safety baselines. Top: cumulative pseudo-loss minus that of the hindsight best fixed arm i ⋆ . Middle: PRUDENT-BANKER cumulative loss minus comparator loss Lc (negative values mean PRUDENT-BANKER beats x c ). Bottom: the PRUDENT-BANKER mixing coefficient αt. Under delay, PRUDENT-BANKER repeatedly performs conservative restarts and only re-enters aggr… view at source ↗
Figure 2
Figure 2. Figure 2: Dynamic performance against the best-fixed-arm comparator. The top row plots cumulative [PITH_FULL_IMAGE:figures/full_fig_p047_2.png] view at source ↗
read the original abstract

We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Prudent-Banker, an algorithm for adversarial multi-armed bandits (with and without delays) that combines a delay-adapted Online Mirror Descent with a modified phased-aggression mechanism. Its central claim is that a delay-calibrated restart threshold enables the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against a designated safe baseline comparator. The paper also derives new lower bounds showing these rates are unimprovable (up to log factors) under the baseline-safety constraint, and reports experiments across delay distributions.

Significance. If the central claims hold, the work would be significant for online learning: it is presented as the first algorithm to achieve the stated optimal trade-off simultaneously with and without delays, backed by matching lower bounds. The explicit handling of arbitrary delays while preserving the Õ(1) safety guarantee against the baseline addresses a practical gap in existing delay-robust methods. The experiments provide initial evidence of practical balance between safety and learning.

major comments (2)
  1. [Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.
  2. [Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.
minor comments (2)
  1. [Experiments] The experimental section should specify the exact delay distributions, number of runs, and baseline implementations to support reproducibility claims.
  2. [Preliminaries] Notation for the restart threshold and the safe comparator should be defined consistently in the main text before first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the two major comments point by point below, clarifying the locations of the relevant proofs and constructions while indicating revisions to improve accessibility.

read point-by-point responses
  1. Referee: [Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.

    Authors: The derivation of the delay-calibrated restart threshold, including its bounding of worst-case distortion from unobserved feedback and preservation of the Õ(1) safety regret, appears in Section 4 (Theorems 4.1–4.2 and Lemmas 4.3–4.5). The complete proof is in Appendix B. To make this more immediately verifiable, we will insert an explicit pointer to these results in the abstract and the opening of Section 4 in the revised version. revision: yes

  2. Referee: [Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.

    Authors: Section 5 presents the lower-bound construction for the safety-constrained setting (two-arm adversarial instance with arbitrary delays). The safe comparator is defined as the best fixed arm under the baseline-safety constraint; the proof shows Ω(√T + √D) pseudo-regret and Ω(1) safety regret. Constants are compared to the upper bound immediately after Theorem 5.2, confirming matching up to logs. We will expand the construction details, restate the comparator assumptions explicitly, and add a side-by-side constant comparison table in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives regret bounds for Prudent-Banker via a delay-adapted Online Mirror Descent combined with a phased-aggression mechanism whose restart threshold is constructed from a worst-case analysis of unobserved feedback distortion. The claimed pseudo-regret Õ(√T + √D) and Õ(1) safety regret are obtained by standard potential-function arguments and a separate lower-bound construction; neither quantity is obtained by fitting parameters to the target rates nor by renaming an input quantity. No self-citations are invoked as load-bearing uniqueness theorems, and the derivation relies on externally verifiable martingale and convex-optimization tools rather than internal redefinitions. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, parameter lists, or proof structure visible to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5788 in / 1095 out tokens · 20033 ms · 2026-05-25T05:25:50.117400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

  1. [1]

    Competing in the dark: An efficient algorithm for bandit linear optimization

    Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. InConference on Learning Theory, number 110, 2009

  2. [2]

    Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

    Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

  3. [3]

    Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

    Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

  4. [4]

    Gambling in a rigged casino: The adversarial multi-armed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

  5. [5]

    The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

    Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

  6. [6]

    Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

    Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

  7. [7]

    Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

    Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

  8. [8]

    What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

    Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

  9. [9]

    Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

    Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

  10. [10]

    No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

    Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

  11. [11]

    Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

    Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

  12. [12]

    Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

    Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

  13. [13]

    Online learning for qoe-based video streaming to mobile receivers

    Nesrine Changuel, Bessem Sayadi, and Michel Kieffer. Online learning for qoe-based video streaming to mobile receivers. In2012 IEEE Globecom Workshops, pages 1319–1324. IEEE, 2012. 10

  14. [14]

    Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

    Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, and Ding Zhao. Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

  15. [15]

    Black-box reductions for parameter-free online learning in banach spaces

    Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018

  16. [16]

    Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

    Yan Dai, Haipeng Luo, and Liyu Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

  17. [17]

    Safely using predictions in general-sum normal form games

    Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. InProceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

  18. [18]

    Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J

    Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J. Mach. Learn. Res., 15(1):3873–3923, 2014

  19. [19]

    Efficient Optimal Learning for Contextual Bandits

    Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369, 2011

  20. [20]

    Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

    Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

  21. [21]

    Regret to the best vs

    Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

  22. [22]

    Online learning with optimism and delay

    Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363–3373. PMLR, 2021

  23. [23]

    Online convex optimization in the bandit setting: gradient descent without a gradient

    Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex opti- mization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

  24. [24]

    A second-order bound with excess losses

    Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014

  25. [25]

    Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

    Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

  26. [26]

    Conserva- tive exploration in reinforcement learning

    Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conserva- tive exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

  27. [27]

    Regret bounds for prediction problems

    Geoffrey J Gordon. Regret bounds for prediction problems. InProceedings of the twelfth annual conference on Computational learning theory, pages 29–40, 1999

  28. [28]

    Adapting to delays and data in adversarial multi-armed bandits

    Andras Gyorgy and Pooria Joulani. Adapting to delays and data in adversarial multi-armed bandits. InInternational Conference on Machine Learning, pages 3988–3997. PMLR, 2021

  29. [29]

    Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

    Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

  30. [30]

    Robust wireless scheduling under arbitrary channel dynamics and feedback delay

    Jiatai Huang and Longbo Huang. Robust wireless scheduling under arbitrary channel dynamics and feedback delay. In2021 33th International Teletraffic Congress (ITC-33), pages 1–8. IEEE, 2021

  31. [31]

    Banker online mirror descent: A universal approach for delayed online bandit learning

    Jiatai Huang, Yan Dai, and Longbo Huang. Banker online mirror descent: A universal approach for delayed online bandit learning. InInternational Conference on Machine Learning, pages 13814–13844. PMLR, 2023

  32. [32]

    Adaptive online prediction by following the perturbed leader

    Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005. 11

  33. [33]

    Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

    Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

  34. [34]

    Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

    Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

  35. [35]

    Online learning under delayed feedback

    Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

  36. [36]

    Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

    Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

  37. [37]

    Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

    Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

  38. [38]

    The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

    Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

  39. [39]

    Learning adversarial markov deci- sion processes with delayed feedback

    Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov deci- sion processes with delayed feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7281–7289, 2022

  40. [40]

    The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

    Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

  41. [41]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020

  42. [42]

    A contextual-bandit approach to personalized news article recommendation

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

  43. [43]

    Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

    Mingyang Liu, Chengjie Wu, Qihan Liu, Yansen Jing, Jun Yang, Pingzhong Tang, and Chongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

  44. [44]

    Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

    Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

  45. [45]

    Setting up a reinforcement learning task with a real-world robot

    A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018

  46. [46]

    Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

    Arnab Maiti, Kevin Jamieson, and Lillian J Ratliff. Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

  47. [47]

    A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

    Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

  48. [48]

    On-line learning with delayed label feedback

    Chris Mesterharm. On-line learning with delayed label feedback. InInternational Conference on Algorithmic Learning Theory, pages 399–413. Springer, 2005

  49. [49]

    Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

    Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and V olkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

  50. [50]

    Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

    Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

  51. [51]

    A Modern Introduction to Online Learning

    Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019. 12

  52. [52]

    Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

    Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

  53. [53]

    Just-in-time adaptive anomaly detection and personalized health feedback

    Parvaneh Parvin. Just-in-time adaptive anomaly detection and personalized health feedback. 2020

  54. [54]

    Some aspects of the sequential design of experiments

    Herbert Robbins. Some aspects of the sequential design of experiments. 1952

  55. [55]

    Exploiting easy data in online optimization

    Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014

  56. [56]

    Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

    Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

  57. [57]

    Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

    Tobias Sommer Thune, Nicolò Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

  58. [58]

    Comparator-adaptive convex bandits

    Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits. Advances in Neural Information Processing Systems, 33:19795–19804, 2020

  59. [59]

    Linear bandits with stochastic delayed feedback

    Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. InInternational Confer- ence on Machine Learning, pages 9712–9721. PMLR, 2020

  60. [60]

    Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence

    Manfred K Warmuth. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. InElectronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, 1998, 1998

  61. [61]

    A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

    James MS Wason and Lorenzo Trippa. A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

  62. [62]

    More adaptive algorithms for adversarial bandits

    Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. InConference On Learning Theory, pages 1263–1291. PMLR, 2018

  63. [63]

    Conservative bandits

    Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

  64. [64]

    Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

    Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

  65. [65]

    safe AI

    Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. InInternational Conference on Artificial Intelligence and Statistics, pages 3285–3294. PMLR, 2020. 13 Contents 1 Introduction 1 1.1 How Delay Breaks Safety: Three Roadblocks . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contributions . . . . . . . . ...

  66. [66]

    In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test

    Eliminating the safety certification gap.We derive adelay-calibrated restart conditionthat accounts for the information deficit induced by delayed feedback. In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test. We show that under delays, this test must be relaxed by an explicit slack term...

  67. [67]

    doubling trick

    Cost-free adaptation to unknown delays.A central challenge is adapting to an unknown total delay D without compromising safety. Standard “doubling trick” approaches [3, 4] typically treat each restart as a fresh instance. While such restarts incur only mild overhead for conventional regret, they can be catastrophic for safety: naively resetting the mechan...

  68. [68]

    arrived” and “pending

    Generalizing Banker-OMD to Safe Learning.Finally, we answer the challenge of extending safety guarantees beyond standard immediate-feedback environments. While Müller et al. [49] established the foundations of the (Safe) online linear minimization framework, the feasibility of such guarantees under partial monitoring remained a question. We bridge this ga...

  69. [69]

    If all feedback arrives by end of phase k(s) (i.e., r+d r ≤end k(s) for all r), then min{dr,end k(s) −r}=d r, andD k(s) endk(s) =P dr =D k(s) endk(s)

  70. [70]

    arrived” feedback (controlled by the phase-gap definition since restart didn’t trigger) and “missing

    If some feedback arrives after endk(s) (i.e., r+d r >end k(s) for some r), then min{dr,end k(s) −r}< d r for thoser, and thusD k(s) endk(s) < D k(s) endk(s) . In all cases,D k(s) endk(s) ≤D (s) endk(s) . 21 C.2 Proof of Lemma C.2 (Bound on Estimated Total Delay) Intuition. This lemma establishes that the updated budget bDs+1 provides a tight upper bound o...

  71. [71]

    For everym∈[M−1], L2 m ≥ X t∈Bm+1 dt. Proof. We borrow the proof idea from Masoudian et al. [47]. For every m, since bm ∈B m, we have bm +d bm > b m+1 −1. Sinceb m,d bm, andb m+1 are integers, it follows that bm +d bm ≥b m+1, and therefore Lm =b m+1 −b m ≤d bm. On the other hand, by construction of the greedy buckets, there existst m ∈B m such that tm +d ...