pith. machine review for the scientific record. sign in

arxiv: 1806.01265 · v2 · submitted 2018-06-01 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIstat.ML
keywords learningmodelfunctionlossmodel-basedreinforcementvalue-awarewasserstein
0
0 comments X
read the original abstract

Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recently Farahmand et al. (2017) proposed a value-aware model learning (VAML) objective that captures the structure of value function during model learning. Using tools from Asadi et al. (2018), we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric. This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement~learning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 6.0

    The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...

  2. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.