Me, my- self, and AI: The situational awareness dataset (SAD) for LLMs.arXiv preprint arXiv:2407.04694

Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans · 2024 · arXiv 2407.04694

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · conditional · novelty 6.0

Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

cs.LG · 2024-10-02 · unverdicted · novelty 6.0

Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

citing papers explorer

Showing 5 of 5 citing papers.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 25
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 9
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 21
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · conditional · none · ref 8
Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct cs.LG · 2024-10-02 · unverdicted · none · ref 10
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.

Me, my- self, and AI: The situational awareness dataset (SAD) for LLMs.arXiv preprint arXiv:2407.04694

fields

years

verdicts

representative citing papers

citing papers explorer