arXiv preprint arXiv:1803.06971 , year=

What doubling tricks can, can't do for multi-armed bandits , author= · 2018 · stat.ML · arXiv 1803.06971

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the "Doubling Trick". In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (minimax) bounds in $R\_T = O(\sqrt{T})$ but cannot conserve (distribution-dependent) bounds in $R\_T = O(\log T)$. We give insights as to why exponential doubling tricks may be better, as they conserve bounds in $R\_T = O(\log T)$, and are close to conserving bounds in $R\_T = O(\sqrt{T})$.

representative citing papers

Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits

stat.ML · 2026-01-31 · unverdicted · novelty 8.0

First shuffle-DP and joint-DP algorithms for GLM contextual bandits achieve near non-private regret without strong spectral assumptions on contexts.

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.

Online Market Making and the Value of Observing the Order Book

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Introduces action-dependent order-book feedback for online market making, yielding O(sqrt(T)) high-probability regret in stochastic i.i.d. and mean-reverting settings without smoothness assumptions, and O(T^{2/3}) in the adversarial case.

Learning Safely Without Knowing the World:COMPASS-Hedge

cs.LG · 2026-03-22 · unverdicted · novelty 7.0

COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a baseline policy.

Constrained Contextual Bandits with Adversarial Contexts

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A modular reduction from budget-constrained contextual bandits with adversarial contexts to unconstrained bandits via surrogate rewards, yielding improved guarantees and an efficient algorithm based on SquareCB.

Optimal Semiparametric Dynamic Pricing with Feature Diversity

stat.ME · 2026-05-05 · unverdicted · novelty 7.0

A stagewise greedy algorithm for semiparametric contextual dynamic pricing achieves regret T to the max of 1/2 and 3 over (2 beta plus 1) for linear m, with a matching lower bound proving optimality.

A single algorithm for both restless and rested rotting bandits

stat.ML · 2026-04-23 · unverdicted · novelty 7.0

RAW-UCB achieves near-optimal regret in both rested and restless rotting bandits without prior knowledge of the setting or non-stationarity type.

Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

A projection-based algorithm for COCO achieves O(log T) regret and O(log T) CCV for strongly convex losses and O(sqrt(T)) for convex losses by leveraging self-contracted curves.

citing papers explorer

Showing 8 of 8 citing papers.

Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits stat.ML · 2026-01-31 · unverdicted · none · ref 1 · internal anchor
First shuffle-DP and joint-DP algorithms for GLM contextual bandits achieve near non-private regret without strong spectral assumptions on contexts.
Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays cs.LG · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.
Online Market Making and the Value of Observing the Order Book cs.LG · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
Introduces action-dependent order-book feedback for online market making, yielding O(sqrt(T)) high-probability regret in stochastic i.i.d. and mean-reverting settings without smoothness assumptions, and O(T^{2/3}) in the adversarial case.
Learning Safely Without Knowing the World:COMPASS-Hedge cs.LG · 2026-03-22 · unverdicted · none · ref 10 · internal anchor
COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a baseline policy.
Constrained Contextual Bandits with Adversarial Contexts cs.LG · 2026-05-07 · unverdicted · none · ref 278
A modular reduction from budget-constrained contextual bandits with adversarial contexts to unconstrained bandits via surrogate rewards, yielding improved guarantees and an efficient algorithm based on SquareCB.
Optimal Semiparametric Dynamic Pricing with Feature Diversity stat.ME · 2026-05-05 · unverdicted · none · ref 76
A stagewise greedy algorithm for semiparametric contextual dynamic pricing achieves regret T to the max of 1/2 and 3 over (2 beta plus 1) for linear m, with a matching lower bound proving optimality.
A single algorithm for both restless and rested rotting bandits stat.ML · 2026-04-23 · unverdicted · none · ref 8
RAW-UCB achieves near-optimal regret in both rested and restless rotting bandits without prior knowledge of the setting or non-stationarity type.
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction cs.LG · 2026-05-20 · unverdicted · none · ref 288 · internal anchor
A projection-based algorithm for COCO achieves O(log T) regret and O(log T) CCV for strongly convex losses and O(sqrt(T)) for convex losses by leveraging self-contracted curves.

arXiv preprint arXiv:1803.06971 , year=

fields

years

verdicts

representative citing papers

citing papers explorer