A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Exploration-exploitation in constrained mdps
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.
A new primal-dual algorithm for adversarial linear CMDPs achieves the first sublinear regret and constraint violation bounds of order K to the 3/4 using weighted LogSumExp softmax policies with periodic mixing and regularized dual updates.
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Presents the first algorithm to identify an ε-optimal policy in robust constrained MDPs via epigraph form and bisection search with Õ(ε^{-4}) robust policy evaluations.
PDR-ANPG achieves last-iterate ε-optimality gap and ε constraint violation in CMDPs with sample complexity Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}) for parameterized policies with transferred compatibility error ε_bias.
citing papers explorer
-
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
-
Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses
A new primal-dual algorithm for adversarial linear CMDPs achieves the first sublinear regret and constraint violation bounds of order K to the 3/4 using weighted LogSumExp softmax policies with periodic mixing and regularized dual updates.
-
Online Resource Allocation With General Constraints
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.