Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv preprint arXiv:2311.08379
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
Self-modification in superintelligence collapses via non-commuting operators into a structure identical to Priest's inclosure schema and Derrida's différance.
citing papers explorer
-
Frontier Models are Capable of In-context Scheming
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
-
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
-
Scheming Ability in LLM-to-LLM Strategic Interactions
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
-
Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance
Self-modification in superintelligence collapses via non-commuting operators into a structure identical to Priest's inclosure schema and Derrida's différance.