pith. machine review for the scientific record. sign in

arxiv: 1206.6400 · v1 · submitted 2012-06-27 · 💻 cs.LG · stat.ML

Recognition: unknown

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Authors on Pith no claims yet
classification 💻 cs.LG stat.ML
keywords regretalgorithmonlineadversarybanditpolicyadaptivesublinear
0
0 comments X
read the original abstract

Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat

    cs.GT 2026-04 unverdicted novelty 7.0

    TVA implements truthful Bayes-Nash reporting in large federations and shows competitive pressure amplifies moral hazard such that forced sharing without incentives can lower welfare below autarky.

  2. Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat

    cs.GT 2026-04 unverdicted novelty 7.0

    TVA mechanism implements truthful posterior reporting as Bayes-Nash equilibrium in decentralized risk detection, with network Shapley coalition analysis and welfare ordering showing backfiring mandates under certain c...

  3. Learning the Preferences of a Learning Agent

    cs.AI 2026-05 unverdicted novelty 6.0

    Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.

  4. Mechanism Design for Decentralized Risk Detection: Strict Propriety, Network Coalitions, and the Backfiring Mandat

    cs.GT 2026-04 unverdicted novelty 6.0

    A dynamic mechanism design framework with strictly proper scoring rules implements truthful reporting in decentralized risk detection among competing firms and identifies conditions for backfiring regulatory mandates.