Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Chen-Yu Wei; Haipeng Luo

arxiv: 2102.05406 · v3 · pith:P2VR2UKSnew · submitted 2021-02-10 · 💻 cs.LG · cs.AI· stat.ML

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Chen-Yu Wei , Haipeng Luo This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords optimalalgorithmblack-boxdeltaknowledgeregretalgorithmsapproach

0 comments

read the original abstract

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalized) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs
cs.LG 2026-06 unverdicted novelty 7.0

Introduces priced face-crossing via normal-fan geometry on occupancy polytopes to decompose dynamic regret into intrinsic motion cost plus within-face error in non-stationary adversarial MDPs.