Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
Surround each question in blockquotes and append to the result from stage 1
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2022 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Discovering Language Model Behaviors with Model-Written Evaluations
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.