Olli Järviniemi and Evan Hubinger

Peter S · 2024 · arXiv 2024.100988

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

cs.CY · 2026-04-06 · unverdicted · novelty 6.0

A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

cs.CY · 2026-05-29 · unverdicted · novelty 3.0

Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 3
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

Olli Järviniemi and Evan Hubinger

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer