Olli Järviniemi and Evan Hubinger

doi: 10 · 2024 · arXiv 2024.100988

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

Fuzzing Large Language Models to Elicit Hidden Behaviours

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Fuzzing via Gaussian noise on weights or residual activations elicits hidden backdoor behaviors more often than temperature sampling on four of six models, with proxy-task hyperparameter selection via Thompson sampling improving results over uniform sweeps.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

cs.CY · 2026-04-06 · unverdicted · novelty 6.0

A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

cs.CY · 2026-05-29 · unverdicted · novelty 3.0

Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Fuzzing Large Language Models to Elicit Hidden Behaviours cs.LG · 2026-06-28 · unverdicted · none · ref 3
Fuzzing via Gaussian noise on weights or residual activations elicits hidden backdoor behaviors more often than temperature sampling on four of six models, with proxy-task hyperparameter selection via Thompson sampling improving results over uniform sweeps.
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 37
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report cs.CL · 2026-06-25 · unverdicted · none · ref 28
Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.
Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs cs.AI · 2026-06-11 · unverdicted · none · ref 2
Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.
To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation cs.AI · 2026-06-06 · unverdicted · none · ref 23
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations cs.CL · 2026-05-27 · unverdicted · none · ref 14
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception cs.CY · 2026-04-06 · unverdicted · none · ref 9
A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence cs.CY · 2026-05-29 · unverdicted · none · ref 30
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.

Olli Järviniemi and Evan Hubinger

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer