Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Emma Brunskill; Jay Whang; Patrick Cho; Ramtin Keramati

arxiv: 1806.00175 · v2 · pith:K6ZCRHHYnew · submitted 2018-06-01 · 💻 cs.AI

Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Ramtin Keramati , Jay Whang , Patrick Cho , Emma Brunskill This is my paper

classification 💻 cs.AI

keywords learningexplorationlearnmodelsplanningreinforcementstrategicalgorithms

0 comments

read the original abstract

Humans learn to play video games significantly faster than the state-of-the-art reinforcement learning (RL) algorithms. People seem to build simple models that are easy to learn to support planning and strategic exploration. Inspired by this, we investigate two issues in leveraging model-based RL for sample efficiency. First we investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods. Second we show how to learn simple deterministic models to support fast learning using object representation. We illustrate the benefit of these ideas by introducing a novel algorithm, Strategic Object Oriented Reinforcement Learning (SOORL), that outperforms state-of-the-art algorithms in the game of Pitfall! in less than 50 episodes.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Is Conditional Generative Modeling all you need for Decision-Making?
cs.LG 2022-11 unverdicted novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.