Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

· 2024 · cs.AI · arXiv 2409.07985

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce reductions from AI-Control Games to a special case of zero-sum partially observable stochastic games that allow us to leverage existing algorithms to find Pareto-optimal protocols. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. To demonstrate the utility of our formalism, we show improvements over empirical studies in existing settings, evaluate protocols in new settings, and analyse how modelling assumptions affect the safety and usefulness of protocols. Finally, we leverage our formalism to precisely describe some of the implicit assumptions in prior control work.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.

AI Alignment via Incentives and Correction

cs.LG · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.

Scheming Ability in LLM-to-LLM Strategic Interactions

cs.CL · 2025-10-11 · conditional · novelty 6.0

Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

citing papers explorer

Showing 4 of 4 citing papers.

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety cs.AI · 2026-06-03 · unverdicted · none · ref 3 · internal anchor
Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 24 · 2 links · internal anchor
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
Scheming Ability in LLM-to-LLM Strategic Interactions cs.CL · 2025-10-11 · conditional · none · ref 24 · internal anchor
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 46 · internal anchor
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer