Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv preprint arXiv:2311.08379

Carlsmith, Joe , year= · 2023 · arXiv 2311.08379

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

cs.CR · 2026-04-16 · unverdicted · novelty 6.0

LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.

Scheming Ability in LLM-to-LLM Strategic Interactions

cs.CL · 2025-10-11 · conditional · novelty 6.0

Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

cs.MA · 2026-03-29 · unverdicted · novelty 5.0

Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance

cs.AI · 2026-04-21 · unverdicted · novelty 4.0

Self-modification in superintelligence collapses via non-commuting operators into a structure identical to Priest's inclosure schema and Derrida's différance.

citing papers explorer

Showing 6 of 6 citing papers.

Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 9
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments cs.CR · 2026-04-16 · unverdicted · none · ref 4
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Scheming Ability in LLM-to-LLM Strategic Interactions cs.CL · 2025-10-11 · conditional · none · ref 8
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
Emergent Social Intelligence Risks in Generative Multi-Agent Systems cs.MA · 2026-03-29 · unverdicted · none · ref 15
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 5
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance cs.AI · 2026-04-21 · unverdicted · none · ref 7
Self-modification in superintelligence collapses via non-commuting operators into a structure identical to Priest's inclosure schema and Derrida's différance.

Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv preprint arXiv:2311.08379

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer