LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
CONDITIONAL 2representative citing papers
DAPO introduces decoupled clipping and dynamic sampling for LLM RL, achieving 50 on AIME 2024 with Qwen2.5-32B while fully open-sourcing code, data, and the verl-based training system.
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
DAPO introduces decoupled clipping and dynamic sampling for LLM RL, achieving 50 on AIME 2024 with Qwen2.5-32B while fully open-sourcing code, data, and the verl-based training system.