Uncovering deceptive tendencies in language models: A simulated company ai assistant,

· 2024 · arXiv 2405.01576

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

OpenAI o1 System Card

cs.AI · 2024-12-21 · unverdicted · novelty 4.0

OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.

citing papers explorer

Showing 2 of 2 citing papers.

Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 19
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
OpenAI o1 System Card cs.AI · 2024-12-21 · unverdicted · none · ref 25
OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.

Uncovering deceptive tendencies in language models: A simulated company ai assistant,

fields

years

verdicts

representative citing papers

citing papers explorer