AI Safety Gridworlds

Andrew Lefrancq; Jan Leike; Laurent Orseau; Miljan Martic; Pedro A. Ortega; Shane Legg; Tom Everitt; Victoria Krakovna

arxiv: 1711.09883 · v2 · pith:YDUPI5DHnew · submitted 2017-11-27 · 💻 cs.LG · cs.AI

AI Safety Gridworlds

Jan Leike , Miljan Martic , Victoria Krakovna , Pedro A. Ortega , Tom Everitt , Andrew Lefrancq , Laurent Orseau , Shane Legg This is my paper

classification 💻 cs.LG cs.AI

keywords functionproblemssafesafetyagentsenvironmentslearningperformance

0 comments

read the original abstract

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals
cs.AI 2026-04 unverdicted novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
Learning the Arrow of Time
cs.LG 2019-07 unverdicted novelty 7.0

Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
cs.SE 2026-05 unverdicted novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
cs.AI 2026-05 unverdicted novelty 6.0

Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
cs.LG 2026-04 unverdicted novelty 6.0

Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
Generalizing from a few environments in safety-critical reinforcement learning
cs.LG 2019-07 unverdicted novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Categorizing Wireheading in Partially Embedded Agents
cs.AI 2019-06 unverdicted novelty 6.0

Presents a taxonomy of wireheading in partially embedded agents, defines wirehead-vulnerable agents, demonstrates via AIXIjs simulation, and conjectures that specification gaming is the only other misalignment type.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement
cs.AI 2025-06 unverdicted novelty 5.0

Presents hierarchical adaptive refinement to accelerate near-optimal policy synthesis in MDPs up to 1M states with up to 2x speedup over PRISM and formal error bounds.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
cs.RO 2025-03 unverdicted novelty 5.0

SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
The Role of Cooperation in Responsible AI Development
cs.CY 2019-07 unverdicted novelty 3.0

Competitive pressures in AI development create collective action problems that may require industry cooperation, with key factors and strategies identified to enable responsible outcomes.
A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions
eess.SY 2025-08 unverdicted novelty 2.0

A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main...