pith. sign in

arxiv: 1611.03907 · v4 · pith:MQUPQDADnew · submitted 2016-11-11 · 💻 cs.AI · cs.LG· stat.ML

Reinforcement Learning in Rich-Observation MDPs using Spectral Methods

classification 💻 cs.AI cs.LGstat.ML
keywords algorithmhiddenmappingstatemdpsregretspacespectral
0
0 comments X
read the original abstract

Reinforcement learning (RL) in Markov decision processes (MDPs) with large state spaces is a challenging problem. The performance of standard RL algorithms degrades drastically with the dimensionality of state space. However, in practice, these large MDPs typically incorporate a latent or hidden low-dimensional structure. In this paper, we study the setting of rich-observation Markov decision processes (ROMDP), where there are a small number of hidden states which possess an injective mapping to the observation states. In other words, every observation state is generated through a single hidden state, and this mapping is unknown a priori. We introduce a spectral decomposition method that consistently learns this mapping, and more importantly, achieves it with low regret. The estimated mapping is integrated into an optimistic RL algorithm (UCRL), which operates on the estimated hidden space. We derive finite-time regret bounds for our algorithm with a weak dependence on the dimensionality of the observed space. In fact, our algorithm asymptotically achieves the same average regret as the oracle UCRL algorithm, which has the knowledge of the mapping from hidden to observed spaces. Thus, we derive an efficient spectral RL algorithm for ROMDPs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Addressing Finite-Horizon MDPs via Low-Rank Tensor Value Approximation

    cs.LG 2025-01 unverdicted novelty 6.0

    Low-rank tensor approximations of value functions enable tractable policy iteration for finite-horizon MDPs, with BCD/BCGD algorithms, convergence guarantees, and error bounds linking evaluation to policy improvement.