Surround each question in blockquotes and append to the result from stage 1

Randomly sample 5 example questions from the 10 human-written questions

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

citing papers explorer

Showing 1 of 1 citing paper.

Discovering Language Model Behaviors with Model-Written Evaluations cs.CL · 2022-12-19 · unverdicted · none · ref 16
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

Surround each question in blockquotes and append to the result from stage 1

fields

years

verdicts

representative citing papers

citing papers explorer