Reinforcement Learning with a Corrupted Reward Channel

Laurent Orseau; Marcus Hutter; Shane Legg; Tom Everitt; Victoria Krakovna

arxiv: 1705.08417 · v2 · pith:V2BQAEC5new · submitted 2017-05-23 · 💻 cs.AI · cs.LG· stat.ML

Reinforcement Learning with a Corrupted Reward Channel

Tom Everitt , Victoria Krakovna , Laurent Orseau , Marcus Hutter , Shane Legg This is my paper

classification 💻 cs.AI cs.LGstat.ML

keywords rewardlearningreinforcementagentproblemsensoryassumptionscorrupt

0 comments

read the original abstract

No real-world reward function is perfect. Sensory errors and software bugs may result in RL agents observing higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Solving math word problems with process- and outcome-based feedback
cs.LG 2022-11 unverdicted novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
Scaling Laws for Reward Model Overoptimization
cs.LG 2022-10 unverdicted novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model pa...