pith. sign in

arxiv: 2605.25114 · v1 · pith:ETLA5BZ5new · submitted 2026-05-24 · 📊 stat.ML · cs.LG

Counterfactually Safe Reinforcement Learning

classification 📊 stat.ML cs.LG
keywords harmlearningexpectedindividualmaximizepolicyreinforcementreturn
0
0 comments X
read the original abstract

Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.