Modelling transition dynamics in MDPs with RKHS embeddings

Arthur Gretton (MPI for Intelligent Systems); Guy Lever (University College London); Luca Baldassarre (University College London); Massi Pontil (University College London); Steffen Grunewalder (University College London)

arxiv: 1206.4655 · v1 · pith:RQJMWSYCnew · submitted 2012-06-18 · 💻 cs.LG

Modelling transition dynamics in MDPs with RKHS embeddings

Steffen Grunewalder (University College London) , Guy Lever (University College London) , Luca Baldassarre (University College London) , Massi Pontil (University College London) , Arthur Gretton (MPI for Intelligent Systems) This is my paper

classification 💻 cs.LG

keywords policyrkhsvalueapproachestimationmdpstransitioncompare

0 comments

read the original abstract

We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization
cs.LG 2026-06 unverdicted novelty 6.0

ROVER pretrains transferable exploration policies by maximizing occupancy coverage with a learned resolvent world model and virtual sink state, outperforming baselines on sparse navigation tasks.