AI Safety Gridworlds
read the original abstract
We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
-
Learning the Arrow of Time
Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.
-
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
-
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
-
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Categorizing Wireheading in Partially Embedded Agents
Presents a taxonomy of wireheading in partially embedded agents, defines wirehead-vulnerable agents, demonstrates via AIXIjs simulation, and conjectures that specification gaming is the only other misalignment type.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement
Presents hierarchical adaptive refinement to accelerate near-optimal policy synthesis in MDPs up to 1M states with up to 2x speedup over PRISM and formal error bounds.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
-
Brainrot: Deskilling and Addiction are Overlooked AI Risks
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
-
The Role of Cooperation in Responsible AI Development
Competitive pressures in AI development create collective action problems that may require industry cooperation, with key factors and strategies identified to enable responsible outcomes.
-
A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions
A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.