Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

Emmanouil-Vasileios Vlatakis-Gkaragkounis; Luanda Cai; Ting Hu

arxiv: 2605.23351 · v1 · pith:WVRCR5ZRnew · submitted 2026-05-22 · 💻 cs.LG · cs.GT

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

Ting Hu , Luanda Cai , Emmanouil-Vasileios Vlatakis-Gkaragkounis This is my paper

Pith reviewed 2026-05-25 05:25 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords adversarial banditsdelayed feedbacksafety-aware regretonline mirror descentphased aggressionregret lower bounds

0 comments

The pith

Prudent-Banker achieves minimax-optimal regret in adversarial bandits with delays while keeping constant regret to a safe baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prudent-Banker for adversarial multi-armed bandits with and without delayed feedback. It targets a safety-aware goal of minimax-optimal worst-case regret paired with nearly constant regret relative to a designated safe baseline policy. Delays risk mistiming the switch between conservative and exploratory phases in prior methods. The algorithm pairs a delay-adapted Online Mirror Descent with a modified phased-aggression scheme whose restart threshold is calibrated to the maximum possible distortion from missing observations. This yields the claimed pseudo-regret of order square root of T plus square root of D together with constant regret against the safe comparator, and the bounds match new lower bounds established in the work.

Core claim

Prudent-Banker is the first algorithm to achieve the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against the safe comparator, both with and without delays. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. The paper also proves matching lower bounds showing these guarantees are unimprovable up to logarithmic factors under the baseline-safety requirement.

What carries the argument

The delay-calibrated restart threshold inside the phased-aggression mechanism, which detects when the safe comparator is suboptimal while compensating for the worst-case effect of unobserved delayed feedback.

If this is right

The same regret guarantees apply whether feedback is delayed or immediate.
The bounds are tight up to logarithmic factors under the baseline-safety requirement.
The algorithm balances safety and learning across diverse delay distributions where standard delay-robust baselines do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same restart-calibration idea could be tested in other partial-feedback settings where timing of information affects safety constraints.
If the threshold construction generalizes, safety constraints may be incorporable into many existing online algorithms without inflating the leading regret term.

Load-bearing premise

The delay-calibrated restart threshold accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality.

What would settle it

A sequence of delayed adversarial bandit instances in which Prudent-Banker incurs more than constant regret against the safe comparator would show the main guarantee does not hold.

Figures

Figures reproduced from arXiv: 2605.23351 by Emmanouil-Vasileios Vlatakis-Gkaragkounis, Luanda Cai, Ting Hu.

**Figure 1.** Figure 1: Dynamic performance of PRUDENT-BANKER versus delay-aware safety baselines. Top: cumulative pseudo-loss minus that of the hindsight best fixed arm i ⋆ . Middle: PRUDENT-BANKER cumulative loss minus comparator loss Lc (negative values mean PRUDENT-BANKER beats x c ). Bottom: the PRUDENT-BANKER mixing coefficient αt. Under delay, PRUDENT-BANKER repeatedly performs conservative restarts and only re-enters aggr… view at source ↗

**Figure 2.** Figure 2: Dynamic performance against the best-fixed-arm comparator. The top row plots cumulative [PITH_FULL_IMAGE:figures/full_fig_p047_2.png] view at source ↗

read the original abstract

We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prudent-Banker claims the first optimal safety-robustness trade-off for adversarial bandits with arbitrary delays via a delay-calibrated restart on phased OMD, backed by lower bounds, but the abstract-only view leaves the key derivation unverified.

read the letter

The main thing is that this paper gives an algorithm that keeps pseudo-regret at Õ(√T + √D) while holding regret to a safe baseline at Õ(1), and it works with or without delays. They also supply matching lower bounds under the safety constraint. That combination looks new based on the abstract's claim that prior work could not preserve the safety guarantee once delays are arbitrary. The technical move is a delay-adapted online mirror descent plus a modified phased-aggression rule whose restart threshold is tuned to the worst-case effect of unobserved feedback. If the derivation holds, it closes a concrete gap that appears in applied settings with safe defaults. The paper does well by stating the rates explicitly and by including the lower-bound argument rather than stopping at an upper bound. The experiments are described only at a high level, but the abstract at least indicates they test across delay distributions and compare against standard delay-robust baselines. The soft spot is that no proofs, pseudocode, or quantitative results are visible, so it is impossible to check whether the restart threshold actually bounds the distortion from delays without introducing extra assumptions or whether the detection of suboptimality is reliable in the worst case. The reader's soundness score of 3.0 reflects exactly this gap. This work is aimed at specialists in adversarial online learning who care about safety constraints and delays. A reader who needs the specific rates would get value if the math checks out. It deserves a serious referee to examine the derivations and the experimental setup in detail. Recommendation: send to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Prudent-Banker, an algorithm for adversarial multi-armed bandits (with and without delays) that combines a delay-adapted Online Mirror Descent with a modified phased-aggression mechanism. Its central claim is that a delay-calibrated restart threshold enables the optimal safety-robustness trade-off: pseudo-regret Õ(√T + √D) together with Õ(1) regret against a designated safe baseline comparator. The paper also derives new lower bounds showing these rates are unimprovable (up to log factors) under the baseline-safety constraint, and reports experiments across delay distributions.

Significance. If the central claims hold, the work would be significant for online learning: it is presented as the first algorithm to achieve the stated optimal trade-off simultaneously with and without delays, backed by matching lower bounds. The explicit handling of arbitrary delays while preserving the Õ(1) safety guarantee against the baseline addresses a practical gap in existing delay-robust methods. The experiments provide initial evidence of practical balance between safety and learning.

major comments (2)

[Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.
[Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.

minor comments (2)

[Experiments] The experimental section should specify the exact delay distributions, number of runs, and baseline implementations to support reproducibility claims.
[Preliminaries] Notation for the restart threshold and the safe comparator should be defined consistently in the main text before first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the two major comments point by point below, clarifying the locations of the relevant proofs and constructions while indicating revisions to improve accessibility.

read point-by-point responses

Referee: [Abstract (key technical contribution paragraph)] The central claim rests on the delay-calibrated restart threshold correctly bounding worst-case distortion from unobserved feedback while detecting comparator suboptimality without inflating the Õ(1) safety regret. The abstract states this property but the provided text does not contain the derivation or proof; this is load-bearing and requires explicit verification in §4 or the appendix before the optimality claim can be assessed.

Authors: The derivation of the delay-calibrated restart threshold, including its bounding of worst-case distortion from unobserved feedback and preservation of the Õ(1) safety regret, appears in Section 4 (Theorems 4.1–4.2 and Lemmas 4.3–4.5). The complete proof is in Appendix B. To make this more immediately verifiable, we will insert an explicit pointer to these results in the abstract and the opening of Section 4 in the revised version. revision: yes
Referee: [Lower bounds (referenced in abstract)] The new lower bounds for safety-constrained adversarial delayed bandits are invoked to show unimprovability up to logs, but the visible text gives no construction details, assumptions on the safe comparator, or explicit comparison to the upper bound constants. This must be checked against the upper-bound analysis to confirm the rates truly match.

Authors: Section 5 presents the lower-bound construction for the safety-constrained setting (two-arm adversarial instance with arbitrary delays). The safe comparator is defined as the best fixed arm under the baseline-safety constraint; the proof shows Ω(√T + √D) pseudo-regret and Ω(1) safety regret. Constants are compared to the upper bound immediately after Theorem 5.2, confirming matching up to logs. We will expand the construction details, restate the comparator assumptions explicitly, and add a side-by-side constant comparison table in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives regret bounds for Prudent-Banker via a delay-adapted Online Mirror Descent combined with a phased-aggression mechanism whose restart threshold is constructed from a worst-case analysis of unobserved feedback distortion. The claimed pseudo-regret Õ(√T + √D) and Õ(1) safety regret are obtained by standard potential-function arguments and a separate lower-bound construction; neither quantity is obtained by fitting parameters to the target rates nor by renaming an input quantity. No self-citations are invoked as load-bearing uniqueness theorems, and the derivation relies on externally verifiable martingale and convex-optimization tools rather than internal redefinitions. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, parameter lists, or proof structure visible to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5788 in / 1095 out tokens · 20033 ms · 2026-05-25T05:25:50.117400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

[1]

Competing in the dark: An efficient algorithm for bandit linear optimization

Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. InConference on Learning Theory, number 110, 2009

work page 2009
[2]

Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

work page 2011
[3]

Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

work page 2010
[4]

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

work page 1995
[5]

The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

work page 2002
[6]

Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

work page 2018
[7]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003
[8]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

work page 2019
[10]

No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

work page 2022
[11]

Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

work page 2007
[12]

Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

work page 2019
[13]

Online learning for qoe-based video streaming to mobile receivers

Nesrine Changuel, Bessem Sayadi, and Michel Kieffer. Online learning for qoe-based video streaming to mobile receivers. In2012 IEEE Globecom Workshops, pages 1319–1324. IEEE, 2012. 10

work page 2012
[14]

Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, and Ding Zhao. Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

work page arXiv 2005
[15]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018

work page 2018
[16]

Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

Yan Dai, Haipeng Luo, and Liyu Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

work page 2022
[17]

Safely using predictions in general-sum normal form games

Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. InProceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

work page 2017
[18]

Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J

Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J. Mach. Learn. Res., 15(1):3873–3923, 2014

work page 2014
[19]

Efficient Optimal Learning for Contextual Bandits

Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[20]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003
[21]

Regret to the best vs

Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

work page 2008
[22]

Online learning with optimism and delay

Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363–3373. PMLR, 2021

work page 2021
[23]

Online convex optimization in the bandit setting: gradient descent without a gradient

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex opti- mization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004
[24]

A second-order bound with excess losses

Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014

work page 2014
[25]

Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

work page 2015
[26]

Conserva- tive exploration in reinforcement learning

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conserva- tive exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

work page 2020
[27]

Regret bounds for prediction problems

Geoffrey J Gordon. Regret bounds for prediction problems. InProceedings of the twelfth annual conference on Computational learning theory, pages 29–40, 1999

work page 1999
[28]

Adapting to delays and data in adversarial multi-armed bandits

Andras Gyorgy and Pooria Joulani. Adapting to delays and data in adversarial multi-armed bandits. InInternational Conference on Machine Learning, pages 3988–3997. PMLR, 2021

work page 2021
[29]

Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

work page arXiv 2021
[30]

Robust wireless scheduling under arbitrary channel dynamics and feedback delay

Jiatai Huang and Longbo Huang. Robust wireless scheduling under arbitrary channel dynamics and feedback delay. In2021 33th International Teletraffic Congress (ITC-33), pages 1–8. IEEE, 2021

work page 2021
[31]

Banker online mirror descent: A universal approach for delayed online bandit learning

Jiatai Huang, Yan Dai, and Longbo Huang. Banker online mirror descent: A universal approach for delayed online bandit learning. InInternational Conference on Machine Learning, pages 13814–13844. PMLR, 2023

work page 2023
[32]

Adaptive online prediction by following the perturbed leader

Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005. 11

work page 2005
[33]

Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

work page 2020
[34]

Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

work page 2022
[35]

Online learning under delayed feedback

Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

work page 2013
[36]

Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

work page 2011
[37]

Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[38]

The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

work page 2013
[39]

Learning adversarial markov deci- sion processes with delayed feedback

Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov deci- sion processes with delayed feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7281–7289, 2022

work page 2022
[40]

The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015
[41]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020

work page 2020
[42]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010
[43]

Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

Mingyang Liu, Chengjie Wu, Qihan Liu, Yansen Jing, Jun Yang, Pingzhong Tang, and Chongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

work page 2022
[44]

Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021
[45]

Setting up a reinforcement learning task with a real-world robot

A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018

work page 2018
[46]

Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

Arnab Maiti, Kevin Jamieson, and Lillian J Ratliff. Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

work page arXiv 2023
[47]

A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

work page 2022
[48]

On-line learning with delayed label feedback

Chris Mesterharm. On-line learning with delayed label feedback. InInternational Conference on Algorithmic Learning Theory, pages 399–413. Springer, 2005

work page 2005
[49]

Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and V olkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

work page arXiv 2025
[50]

Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015
[51]

A Modern Introduction to Online Learning

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019. 12

work page internal anchor Pith review Pith/arXiv arXiv 1912
[52]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[53]

Just-in-time adaptive anomaly detection and personalized health feedback

Parvaneh Parvin. Just-in-time adaptive anomaly detection and personalized health feedback. 2020

work page 2020
[54]

Some aspects of the sequential design of experiments

Herbert Robbins. Some aspects of the sequential design of experiments. 1952

work page 1952
[55]

Exploiting easy data in online optimization

Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[56]

Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

work page 2025
[57]

Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

Tobias Sommer Thune, Nicolò Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[58]

Comparator-adaptive convex bandits

Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits. Advances in Neural Information Processing Systems, 33:19795–19804, 2020

work page 2020
[59]

Linear bandits with stochastic delayed feedback

Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. InInternational Confer- ence on Machine Learning, pages 9712–9721. PMLR, 2020

work page 2020
[60]

Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence

Manfred K Warmuth. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. InElectronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, 1998, 1998

work page 1998
[61]

A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

James MS Wason and Lorenzo Trippa. A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

work page 2014
[62]

More adaptive algorithms for adversarial bandits

Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. InConference On Learning Theory, pages 1263–1291. PMLR, 2018

work page 2018
[63]

Conservative bandits

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

work page 2016
[64]

Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[65]

safe AI

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. InInternational Conference on Artificial Intelligence and Statistics, pages 3285–3294. PMLR, 2020. 13 Contents 1 Introduction 1 1.1 How Delay Breaks Safety: Three Roadblocks . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contributions . . . . . . . . ...

work page 2020
[66]

In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test

Eliminating the safety certification gap.We derive adelay-calibrated restart conditionthat accounts for the information deficit induced by delayed feedback. In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test. We show that under delays, this test must be relaxed by an explicit slack term...

work page
[67]

doubling trick

Cost-free adaptation to unknown delays.A central challenge is adapting to an unknown total delay D without compromising safety. Standard “doubling trick” approaches [3, 4] typically treat each restart as a fresh instance. While such restarts incur only mild overhead for conventional regret, they can be catastrophic for safety: naively resetting the mechan...

work page
[68]

arrived” and “pending

Generalizing Banker-OMD to Safe Learning.Finally, we answer the challenge of extending safety guarantees beyond standard immediate-feedback environments. While Müller et al. [49] established the foundations of the (Safe) online linear minimization framework, the feasibility of such guarantees under partial monitoring remained a question. We bridge this ga...

work page
[69]

If all feedback arrives by end of phase k(s) (i.e., r+d r ≤end k(s) for all r), then min{dr,end k(s) −r}=d r, andD k(s) endk(s) =P dr =D k(s) endk(s)

work page
[70]

arrived” feedback (controlled by the phase-gap definition since restart didn’t trigger) and “missing

If some feedback arrives after endk(s) (i.e., r+d r >end k(s) for some r), then min{dr,end k(s) −r}< d r for thoser, and thusD k(s) endk(s) < D k(s) endk(s) . In all cases,D k(s) endk(s) ≤D (s) endk(s) . 21 C.2 Proof of Lemma C.2 (Bound on Estimated Total Delay) Intuition. This lemma establishes that the updated budget bDs+1 provides a tight upper bound o...

work page
[71]

For everym∈[M−1], L2 m ≥ X t∈Bm+1 dt. Proof. We borrow the proof idea from Masoudian et al. [47]. For every m, since bm ∈B m, we have bm +d bm > b m+1 −1. Sinceb m,d bm, andb m+1 are integers, it follows that bm +d bm ≥b m+1, and therefore Lm =b m+1 −b m ≤d bm. On the other hand, by construction of the greedy buckets, there existst m ∈B m such that tm +d ...

work page 2065

[1] [1]

Competing in the dark: An efficient algorithm for bandit linear optimization

Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. InConference on Learning Theory, number 110, 2009

work page 2009

[2] [2]

Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization.Advances in neural information processing systems, 24, 2011

work page 2011

[3] [3]

Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

work page 2010

[4] [4]

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. InProceedings of IEEE 36th annual foundations of computer science, pages 322–331. IEEE, 1995

work page 1995

[5] [5]

The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem.SIAM journal on computing, 32(1):48–77, 2002

work page 2002

[6] [6]

Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knap- sacks.Journal of the ACM (JACM), 65(3):1–55, 2018

work page 2018

[7] [7]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003

[8] [8]

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.arXiv preprint arXiv:1803.06971, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learn- ing in adversarial bandits with delayed feedback.Advances in neural information processing systems, 32, 2019

work page 2019

[10] [10]

No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. No weighted- regret learning in adversarial bandits with delays.Journal of Machine Learning Research, 23 (139):1–43, 2022

work page 2022

[11] [11]

Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice.Machine Learning, 66(2):321–352, 2007

work page 2007

[12] [12]

Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in non- stochastic bandits.Journal of Machine Learning Research, 20(17):1–38, 2019

work page 2019

[13] [13]

Online learning for qoe-based video streaming to mobile receivers

Nesrine Changuel, Bessem Sayadi, and Michel Kieffer. Online learning for qoe-based video streaming to mobile receivers. In2012 IEEE Globecom Workshops, pages 1319–1324. IEEE, 2012. 10

work page 2012

[14] [14]

Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, and Ding Zhao. Delay-aware multi- agent reinforcement learning for cooperative and competitive environments.arXiv preprint arXiv:2005.05441, 2020

work page arXiv 2005

[15] [15]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018

work page 2018

[16] [16]

Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

Yan Dai, Haipeng Luo, and Liyu Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback.Advances in Neural Information Processing Systems, 35:11437–11449, 2022

work page 2022

[17] [17]

Safely using predictions in general-sum normal form games

Steven Damer and Maria Gini. Safely using predictions in general-sum normal form games. InProceedings of the 16th conference on autonomous agents and multiagent systems, pages 924–932, 2017

work page 2017

[18] [18]

Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J

Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.J. Mach. Learn. Res., 15(1):3873–3923, 2014

work page 2014

[19] [19]

Efficient Optimal Learning for Contextual Bandits

Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits.arXiv preprint arXiv:1106.2369, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[20] [20]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003

[21] [21]

Regret to the best vs

Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average.Machine Learning, 72(1):21–37, 2008

work page 2008

[22] [22]

Online learning with optimism and delay

Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363–3373. PMLR, 2021

work page 2021

[23] [23]

Online convex optimization in the bandit setting: gradient descent without a gradient

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex opti- mization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004

[24] [24]

A second-order bound with excess losses

Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. PMLR, 2014

work page 2014

[25] [25]

Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

Sam Ganzfried and Tuomas Sandholm. Safe opponent exploitation.ACM Transactions on Economics and Computation (TEAC), 3(2):1–28, 2015

work page 2015

[26] [26]

Conserva- tive exploration in reinforcement learning

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Conserva- tive exploration in reinforcement learning. InInternational conference on artificial intelligence and statistics, pages 1431–1441. PMLR, 2020

work page 2020

[27] [27]

Regret bounds for prediction problems

Geoffrey J Gordon. Regret bounds for prediction problems. InProceedings of the twelfth annual conference on Computational learning theory, pages 29–40, 1999

work page 1999

[28] [28]

Adapting to delays and data in adversarial multi-armed bandits

Andras Gyorgy and Pooria Joulani. Adapting to delays and data in adversarial multi-armed bandits. InInternational Conference on Machine Learning, pages 3988–3997. PMLR, 2021

work page 2021

[29] [29]

Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Delayed feedback in episodic rein- forcement learning.arXiv preprint arXiv:2111.07615, 2021

work page arXiv 2021

[30] [30]

Robust wireless scheduling under arbitrary channel dynamics and feedback delay

Jiatai Huang and Longbo Huang. Robust wireless scheduling under arbitrary channel dynamics and feedback delay. In2021 33th International Teletraffic Congress (ITC-33), pages 1–8. IEEE, 2021

work page 2021

[31] [31]

Banker online mirror descent: A universal approach for delayed online bandit learning

Jiatai Huang, Yan Dai, and Longbo Huang. Banker online mirror descent: A universal approach for delayed online bandit learning. InInternational Conference on Machine Learning, pages 13814–13844. PMLR, 2023

work page 2023

[32] [32]

Adaptive online prediction by following the perturbed leader

Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. 2005. 11

work page 2005

[33] [33]

Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits.Advances in Neural Information Processing Systems, 33:4872–4883, 2020

work page 2020

[34] [34]

Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near- optimal regret for adversarial mdp with delayed bandit feedback.Advances in Neural Informa- tion Processing Systems, 35:33469–33481, 2022

work page 2022

[35] [35]

Online learning under delayed feedback

Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. InInternational conference on machine learning, pages 1453–1461. PMLR, 2013

work page 2013

[36] [36]

Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss.Advances in Neural Information Processing Systems, 24, 2011

work page 2011

[37] [37]

Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit ex- ploration in bandit problems with side observations.Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[38] [38]

The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

Wouter M Koolen. The pareto regret frontier.Advances in Neural Information Processing Systems, 26, 2013

work page 2013

[39] [39]

Learning adversarial markov deci- sion processes with delayed feedback

Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov deci- sion processes with delayed feedback. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7281–7289, 2022

work page 2022

[40] [40]

The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

Tor Lattimore. The pareto regret frontier for bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015

[41] [41]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020

work page 2020

[42] [42]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010

[43] [43]

Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

Mingyang Liu, Chengjie Wu, Qihan Liu, Yansen Jing, Jun Yang, Pingzhong Tang, and Chongjie Zhang. Safe opponent-exploitation subgame refinement.Advances in Neural Information Processing Systems, 35:27610–27622, 2022

work page 2022

[44] [44]

Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps.Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021

[45] [45]

Setting up a reinforcement learning task with a real-world robot

A Rupam Mahmood, Dmytro Korenkevych, Brent J Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4635–4640. IEEE, 2018

work page 2018

[46] [46]

Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

Arnab Maiti, Kevin Jamieson, and Lillian J Ratliff. Logarithmic regret for matrix games against an adversary with noisy bandit feedback.arXiv preprint arXiv:2306.13233, 2023

work page arXiv 2023

[47] [47]

A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback.Advances in Neural Information Processing Systems, 35: 11752–11762, 2022

work page 2022

[48] [48]

On-line learning with delayed label feedback

Chris Mesterharm. On-line learning with delayed label feedback. InInternational Conference on Algorithmic Learning Theory, pages 399–413. Springer, 2005

work page 2005

[49] [49]

Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, and V olkan Cevher. Best of both worlds: Regret minimization versus minimax play.arXiv preprint arXiv:2502.11673, 2025

work page arXiv 2025

[50] [50]

Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits.Advances in Neural Information Processing Systems, 28, 2015

work page 2015

[51] [51]

A Modern Introduction to Online Learning

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019. 12

work page internal anchor Pith review Pith/arXiv arXiv 1912

[52] [52]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[53] [53]

Just-in-time adaptive anomaly detection and personalized health feedback

Parvaneh Parvin. Just-in-time adaptive anomaly detection and personalized health feedback. 2020

work page 2020

[54] [54]

Some aspects of the sequential design of experiments

Herbert Robbins. Some aspects of the sequential design of experiments. 1952

work page 1952

[55] [55]

Exploiting easy data in online optimization

Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[56] [56]

Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends® in Machine Learning, 4(2):107–194, 2025

work page 2025

[57] [57]

Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

Tobias Sommer Thune, Nicolò Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[58] [58]

Comparator-adaptive convex bandits

Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. Comparator-adaptive convex bandits. Advances in Neural Information Processing Systems, 33:19795–19804, 2020

work page 2020

[59] [59]

Linear bandits with stochastic delayed feedback

Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. InInternational Confer- ence on Machine Learning, pages 9712–9721. PMLR, 2020

work page 2020

[60] [60]

Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence

Manfred K Warmuth. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. InElectronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, 1998, 1998

work page 1998

[61] [61]

A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

James MS Wason and Lorenzo Trippa. A comparison of bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials.Statistics in medicine, 33(13):2206–2221, 2014

work page 2014

[62] [62]

More adaptive algorithms for adversarial bandits

Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. InConference On Learning Theory, pages 1263–1291. PMLR, 2018

work page 2018

[63] [63]

Conservative bandits

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016

work page 2016

[64] [64]

Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[65] [65]

safe AI

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. InInternational Conference on Artificial Intelligence and Statistics, pages 3285–3294. PMLR, 2020. 13 Contents 1 Introduction 1 1.1 How Delay Breaks Safety: Three Roadblocks . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contributions . . . . . . . . ...

work page 2020

[66] [66]

In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test

Eliminating the safety certification gap.We derive adelay-calibrated restart conditionthat accounts for the information deficit induced by delayed feedback. In standard bandits, verifying that the baseline/comparator is suboptimal reduces to a relatively direct hypothesis test. We show that under delays, this test must be relaxed by an explicit slack term...

work page

[67] [67]

doubling trick

Cost-free adaptation to unknown delays.A central challenge is adapting to an unknown total delay D without compromising safety. Standard “doubling trick” approaches [3, 4] typically treat each restart as a fresh instance. While such restarts incur only mild overhead for conventional regret, they can be catastrophic for safety: naively resetting the mechan...

work page

[68] [68]

arrived” and “pending

Generalizing Banker-OMD to Safe Learning.Finally, we answer the challenge of extending safety guarantees beyond standard immediate-feedback environments. While Müller et al. [49] established the foundations of the (Safe) online linear minimization framework, the feasibility of such guarantees under partial monitoring remained a question. We bridge this ga...

work page

[69] [69]

If all feedback arrives by end of phase k(s) (i.e., r+d r ≤end k(s) for all r), then min{dr,end k(s) −r}=d r, andD k(s) endk(s) =P dr =D k(s) endk(s)

work page

[70] [70]

arrived” feedback (controlled by the phase-gap definition since restart didn’t trigger) and “missing

If some feedback arrives after endk(s) (i.e., r+d r >end k(s) for some r), then min{dr,end k(s) −r}< d r for thoser, and thusD k(s) endk(s) < D k(s) endk(s) . In all cases,D k(s) endk(s) ≤D (s) endk(s) . 21 C.2 Proof of Lemma C.2 (Bound on Estimated Total Delay) Intuition. This lemma establishes that the updated budget bDs+1 provides a tight upper bound o...

work page

[71] [71]

For everym∈[M−1], L2 m ≥ X t∈Bm+1 dt. Proof. We borrow the proof idea from Masoudian et al. [47]. For every m, since bm ∈B m, we have bm +d bm > b m+1 −1. Sinceb m,d bm, andb m+1 are integers, it follows that bm +d bm ≥b m+1, and therefore Lm =b m+1 −b m ≤d bm. On the other hand, by construction of the greedy buckets, there existst m ∈B m such that tm +d ...

work page 2065