With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
SIAM journal on computing , volume=
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Learnability of adversarial noisy bandits is characterized by the convexified generalized maximin volume for oblivious adversaries and for adaptive adversaries when the arm space is countable.
A restarting-based nonparametric online learning method for dynamic pricing with one-point revenue feedback that achieves regret bounds scaling with time horizon and total market variation.
citing papers explorer
-
Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
-
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
-
On Characterizing Learnability for Adversarial Noisy Bandits
Learnability of adversarial noisy bandits is characterized by the convexified generalized maximin volume for oblivious adversaries and for adaptive adversaries when the arm space is countable.
-
Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity
A restarting-based nonparametric online learning method for dynamic pricing with one-point revenue feedback that achieves regret bounds scaling with time horizon and total market variation.
- Online Learning-to-Defer with Varying Experts