Refining Minimax Regret for Unsupervised Environment Design

Jakob Foerster; Mattie Fellows; Michael Beukman; Michael Dennis; Michael Matthews; Minqi Jiang; Samuel Coward

arxiv: 2402.12284 · v2 · pith:SUVGANQJnew · submitted 2024-02-19 · 💻 cs.LG · cs.AI

Refining Minimax Regret for Unsupervised Environment Design

Michael Beukman , Samuel Coward , Michael Matthews , Mattie Fellows , Minqi Jiang , Michael Dennis , Jakob Foerster This is my paper

classification 💻 cs.LG cs.AI

keywords regretlevelslearningminimaxobjectiveadversaryenvironmentpolicy

0 comments

read the original abstract

In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PACE: Parameter Change for Unsupervised Environment Design
cs.LG 2026-05 unverdicted novelty 7.0

PACE uses the squared L2 norm of policy parameter changes from a first-order approximation as an efficient proxy for environment value in UED, outperforming baselines with higher IQM and lower optimality gap on MiniGr...