AI Safety Gridworlds

Jan Leike , Miljan Martic , Victoria Krakovna , Pedro A. Ortega , Tom Everitt , Andrew Lefrancq , Laurent Orseau , Shane Legg

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords functionproblemssafesafetyagentsenvironmentslearningperformance

0 comments

read the original abstract

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals
cs.AI 2026-04 unverdicted novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
cs.SE 2026-05 unverdicted novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
cs.AI 2026-05 unverdicted novelty 6.0

Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
cs.LG 2026-04 unverdicted novelty 6.0

Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.