LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
International symposium on research in attacks, intrusions, and defenses , pages=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
- TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting