A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Exploration-exploitation in constrained mdps
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.
A new primal-dual algorithm for adversarial linear CMDPs achieves the first sublinear regret and constraint violation bounds of order K to the 3/4 using weighted LogSumExp softmax policies with periodic mixing and regularized dual updates.
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Presents the first algorithm to identify an ε-optimal policy in robust constrained MDPs via epigraph form and bisection search with Õ(ε^{-4}) robust policy evaluations.
A reparametrization reduces fair zero-sum games under bandit feedback to standard games on a transformed matrix, enabling an Õ(T^{2/3}) regret bound for learning general mixed fair equilibria via an Explore-Then-Commit algorithm.
PDR-ANPG achieves last-iterate ε-optimality gap and ε constraint violation in CMDPs with sample complexity Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}) for parameterized policies with transferred compatibility error ε_bias.
citing papers explorer
No citing papers match the current filters.