Reward Shaping via Meta-Learning

Dong Yan; Hang Su; Haosheng Zou; Jun Zhu; Tongzheng Ren

arxiv: 1901.09330 · v1 · pith:R6U7LZPQnew · submitted 2019-01-27 · 💻 cs.LG · stat.ML

Reward Shaping via Meta-Learning

Haosheng Zou , Tongzheng Ren , Dong Yan , Hang Su , Jun Zhu This is my paper

classification 💻 cs.LG stat.ML

keywords shapingrewardtasksmeta-learningassignmentcrediteffectivelearning

0 comments

read the original abstract

Reward shaping is one of the most effective methods to tackle the crucial yet challenging problem of credit assignment in Reinforcement Learning (RL). However, designing shaping functions usually requires much expert knowledge and hand-engineering, and the difficulties are further exacerbated given multiple similar tasks to solve. In this paper, we consider reward shaping on a distribution of tasks, and propose a general meta-learning framework to automatically learn the efficient reward shaping on newly sampled tasks, assuming only shared state space but not necessarily action space. We first derive the theoretically optimal reward shaping in terms of credit assignment in model-free RL. We then propose a value-based meta-learning algorithm to extract an effective prior over the optimal reward shaping. The prior can be applied directly to new tasks, or provably adapted to the task-posterior while solving the task within few gradient updates. We demonstrate the effectiveness of our shaping through significantly improved learning efficiency and interpretable visualizations across various settings, including notably a successful transfer from DQN to DDPG.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On Reward-Balancing Methods for Reinforcement Learning
math.OC 2026-04 unverdicted novelty 6.0

Reward-balancing methods normalize RL reward functions to enable greedy optimal policies, reformulated as optimal control with stochastic sampling for uncertainty and shown to improve performance in MPC simulations.