LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
CoRR, abs/1909.06044
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
CONDITIONAL 2representative citing papers
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Red Teaming Language Models with Language Models
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.