OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning

David Meger; Doina Precup; Joelle Pineau; Peter Henderson; Pierre-Luc Bacon; Wei-Di Chang

arxiv: 1709.06683 · v2 · pith:BZKIAICTnew · submitted 2017-09-20 · 💻 cs.LG

OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning

Peter Henderson , Wei-Di Chang , Pierre-Luc Bacon , David Meger , Joelle Pineau , Doina Precup This is my paper

classification 💻 cs.LG

keywords learningoptionsreinforcementrewardinverseadversarialcomplexdemonstrations

0 comments

read the original abstract

Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Intention-Conditioned Flow Occupancy Models
cs.LG 2025-06 unverdicted novelty 5.0

InFOM applies flow matching to model intention-conditioned occupancy measures for RL pre-training, reporting 1.8x median return gains and 36% higher success rates on benchmarks.