pith. sign in

Uncovering deceptive tendencies in language models: A simulated company ai assistant,

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.AI 2

years

2024 2

representative citing papers

OpenAI o1 System Card

cs.AI · 2024-12-21 · unverdicted · novelty 4.0

OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.

citing papers explorer

Showing 2 of 2 citing papers.

  • Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 19

    Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

  • OpenAI o1 System Card cs.AI · 2024-12-21 · unverdicted · none · ref 25

    OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.