LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Social Biases in NLP Models as Barriers for Persons with Disabilities
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.