Olli Järviniemi and Evan Hubinger

URL https://arxiv · 2024 · arXiv 2024.100988

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

cs.CY · 2026-04-06 · unverdicted · novelty 6.0

A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.

citing papers explorer

Showing 2 of 2 citing papers.

Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 3
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception cs.CY · 2026-04-06 · unverdicted · none · ref 9
A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.

Olli Järviniemi and Evan Hubinger

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer