Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Raman Arora (TTIC) , Ofer Dekel (Microsoft Research) , Ambuj Tewari (University of Texas)

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords regretalgorithmonlineadversarybanditpolicyadaptivesublinear

read the original abstract

Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat
cs.GT 2026-04 unverdicted novelty 7.0

TVA implements truthful Bayes-Nash reporting in large federations and shows competitive pressure amplifies moral hazard such that forced sharing without incentives can lower welfare below autarky.
Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat
cs.GT 2026-04 unverdicted novelty 7.0

TVA mechanism implements truthful posterior reporting as Bayes-Nash equilibrium in decentralized risk detection, with network Shapley coalition analysis and welfare ordering showing backfiring mandates under certain c...
Learning the Preferences of a Learning Agent
cs.AI 2026-05 unverdicted novelty 6.0

Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat
cs.GT 2026-04 unverdicted novelty 6.0

A dynamic mechanism design framework with strictly proper scoring rules implements truthful reporting in decentralized risk detection among competing firms and identifies conditions for backfiring regulatory mandates.