Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures

Jonathan Uesato , Ananya Kumar , Csaba Szepesvari , Tom Erez , Avraham Ruderman , Keith Anderson , Krishmamurthy (Dj) Dvijotham , Nicolas Heess

show 1 more author

Pushmeet Kohli

Authors on Pith no claims yet

classification 💻 cs.LG cs.CRstat.ML

keywords evaluationfailuresagentsapproachadversarialagentcatastrophicfailure

0 comments

read the original abstract

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. We focus on two problems: searching for scenarios when learned agents fail and assessing their probability of failure. The standard method for agent evaluation in reinforcement learning, Vanilla Monte Carlo, can miss failures entirely, leading to the deployment of unsafe agents. We demonstrate this is an issue for current agents, where even matching the compute used for training is sometimes insufficient for evaluation. To address this shortcoming, we draw upon the rare event probability estimation literature and propose an adversarial evaluation approach. Our approach focuses evaluation on adversarially chosen situations, while still providing unbiased estimates of failure probabilities. The key difficulty is in identifying these adversarial situations -- since failures are rare there is little signal to drive optimization. To solve this we propose a continuation approach that learns failure modes in related but less robust agents. Our approach also allows reuse of data already collected for training the agent. We demonstrate the efficacy of adversarial evaluation on two standard domains: humanoid control and simulated driving. Experimental results show that our methods can find catastrophic failures and estimate failures rates of agents multiple orders of magnitude faster than standard evaluation schemes, in minutes to hours rather than days.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
cs.LG 2026-04 unverdicted novelty 6.0

PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.