LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
International symposium on research in attacks, intrusions, and defenses , pages=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it