Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna · 2021

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

cs.LG · 2025-03-18 · conditional · novelty 6.0

DAPO introduces decoupled clipping and dynamic sampling for LLM RL, achieving 50 on AIME 2024 with Qwen2.5-32B while fully open-sourcing code, data, and the verl-based training system.

citing papers explorer

Showing 2 of 2 citing papers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 11
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale cs.LG · 2025-03-18 · conditional · none · ref 27
DAPO introduces decoupled clipping and dynamic sampling for LLM RL, achieving 50 on AIME 2024 with Qwen2.5-32B while fully open-sourcing code, data, and the verl-based training system.

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

fields

years

verdicts

representative citing papers

citing papers explorer