Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.

citing papers explorer

Showing 1 of 1 citing paper.

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models cs.CL · 2026-04-01 · unverdicted · none · ref 25
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.

Discovering language model behaviors with model-written evaluations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer