arxiv: 2604.14243 · v2 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

Sourav Ganguly , Kartik Pandit , Arnob Ghosh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords constrained reinforcement learningrobust RLadversarial dynamicsregret boundssafety constraintsmodel-based RLuncertainty decomposition

0 comments

The pith

An algorithm learns policies that remain optimal and safe when state transitions depend on both the agent's actions and an explicit adversarial policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world systems often face state changes driven by external factors like competitors or disturbances that standard constrained reinforcement learning treats as nonexistent. This paper models those factors explicitly as an adversarial policy that co-determines the next state together with the agent's action. It introduces RHC-UCRL, a model-based method that builds optimistic estimates for both the agent's policy and the adversary's policy while separating model uncertainty from random noise. The approach proves that cumulative regret and cumulative constraint violations grow sublinearly in time. If the guarantees hold, agents could learn reliable behaviors in unpredictable settings without the usual risk of sudden safety failures.

Core claim

The paper shows that safety-constrained RL under explicit adversarial dynamics admits an algorithm, Robust Hallucinated Constrained Upper-Confidence RL (RHC-UCRL), that maintains optimism over both agent and adversary policies, separates epistemic from aleatoric uncertainty, and delivers sub-linear regret together with sub-linear constraint violation bounds.

What carries the argument

RHC-UCRL, which maintains optimism over both agent and adversary policies by constructing hallucinated transition models that account for the worst-case exogenous action.

Load-bearing premise

Exogenous factors can be represented as an adversarial policy that the algorithm can optimize against optimistically without needing strong assumptions on how far that policy lies from a known nominal model.

What would settle it

Run RHC-UCRL in a controlled environment where an adversary is explicitly programmed to maximize constraint violations and measure whether cumulative regret and violations stay bounded by a sublinear function of the number of steps.

Figures

Figures reproduced from arXiv: 2604.14243 by Arnob Ghosh, Kartik Pandit, Sourav Ganguly.

**Figure 2.** Figure 2: Performance of RHC-UCRL and RH-UCRL on the Pendulum-v1 environment.(we use [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+\omega_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $\omega_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\bar{\pi}$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper models constrained RL with an explicit adversarial policy in transitions and claims sublinear regret plus violation bounds via optimism over both policies, but the safety part of that optimism needs checking.

read the letter

The main thing to know is that the authors treat the exogenous factor as a strategic adversary policy that co-determines transitions and claim this is the first treatment of safety-constrained RL under that explicit interaction. They introduce RHC-UCRL, which keeps optimism over both the agent's and the adversary's policies while separating epistemic model uncertainty from aleatoric noise, and they state it delivers the usual sublinear regret and constraint violation guarantees.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RHC-UCRL, a model-based algorithm for safety-constrained RL in MDPs where transitions are co-determined by the agent's policy and an explicit adversarial policy representing exogenous factors (s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h). It maintains optimism over both agent and adversary policies while separating epistemic from aleatoric uncertainty, claims sub-linear regret and constraint violation guarantees, and positions itself as the first work to study this setting without nominal-model divergence assumptions.

Significance. If the dual-optimism construction and uncertainty separation rigorously deliver the stated bounds, the work would meaningfully extend robust RL to explicitly strategic adversaries in safety-critical domains. The explicit adversarial-policy modeling and lack of strong divergence assumptions are potential strengths relative to distributional-robustness baselines.

major comments (1)

[Abstract] Abstract (central claim): the construction maintains optimism over both agent and adversary policies yet asserts constraint-violation guarantees against a strategic bar{pi}. For the bound to hold under s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h, optimism on bar{pi} must still induce a sufficiently pessimistic estimate of constraint satisfaction; otherwise residual model error can be exploited by the adversary. No derivation or uncertainty-set construction is shown that resolves this tension while preserving sub-linear violation.

minor comments (1)

[Abstract] The abstract states 'sub-linear regret and constraint violation guarantees' without indicating the dependence on horizon H, state-action space size, or confidence parameters; explicit rates would clarify the result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for pinpointing a key tension in how our dual-optimism construction interacts with constraint-violation bounds. We address the concern directly below and indicate where revisions will be made for clarity.

read point-by-point responses

Referee: [Abstract] Abstract (central claim): the construction maintains optimism over both agent and adversary policies yet asserts constraint-violation guarantees against a strategic bar{pi}. For the bound to hold under s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h, optimism on bar{pi} must still induce a sufficiently pessimistic estimate of constraint satisfaction; otherwise residual model error can be exploited by the adversary. No derivation or uncertainty-set construction is shown that resolves this tension while preserving sub-linear violation.

Authors: We appreciate the referee highlighting this subtlety. In RHC-UCRL the uncertainty sets are constructed separately for the agent and adversary components of the transition function f. Epistemic uncertainty is isolated via Hoeffding-style concentration on the empirical estimates of f, while aleatoric noise omega_h is handled by explicit variance terms. Optimism for the agent selects policies that maximize a lower confidence bound on reward minus a penalty for constraint violation. For the adversary, the same uncertainty set is used but the constraint value function takes the minimum (worst-case) realization over admissible bar{a} within the set; this induces the required pessimism for constraint satisfaction even though the policy itself is chosen optimistically for exploration. The resulting high-probability bound on cumulative violation is sub-linear (O(sqrt(T log(1/delta)))) and is derived in Section 4.2 (uncertainty-set construction) and Theorem 3 (regret and violation analysis), with the full proof in Appendix C. We agree the abstract statement is terse on this mechanism and will revise it to explicitly note that adversary optimism is taken inside a pessimistic constraint evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity detected; claims rest on proposed algorithm without self-referential reductions in available text.

full rationale

The abstract and problem setup introduce RHC-UCRL as a novel model-based algorithm maintaining optimism over both agent and adversary policies, with sub-linear regret and violation guarantees. No equations, derivations, or fitted parameters are presented that reduce predictions to inputs by construction. The 'first work' claim is a novelty assertion independent of self-citations. No load-bearing steps (self-definitional, fitted-input, or self-citation chains) are identifiable from the provided text, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the approach likely inherits standard MDP assumptions and optimism-based RL techniques.

axioms (1)

domain assumption Exogenous factors can be modeled as an adversarial policy co-determining transitions
Central modeling choice stated in abstract but not justified or derived here

invented entities (1)

RHC-UCRL algorithm no independent evidence
purpose: To achieve regret and violation guarantees under adversarial dynamics
New method proposed; no independent evidence outside the paper

pith-pipeline@v0.9.0 · 5619 in / 1212 out tokens · 37188 ms · 2026-05-10T13:32:44.084613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Deep reinforcement learning for robotics: A survey of real- world successes,

C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Martín-Martín, and P. Stone, “Deep reinforcement learning for robotics: A survey of real- world successes,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 8, no. 1, pp. 153–188, 2025

2025
[2]

Combined deep and reinforcement learning with gaming to promote healthcare in neurodevelopmental disorders: A new hypothesis,

F. Stasolla, A. Passaro, E. Curcio, M. Di Gioia, A. Zullo, M. Dragone, and E. Martini, “Combined deep and reinforcement learning with gaming to promote healthcare in neurodevelopmental disorders: A new hypothesis,”Frontiers in Human Neuroscience, vol. 19, p. 1557826, 2025

2025
[3]

Certifiable safe rlhf: Fixed-penalty constraint optimization for safer language models,

K. Pandit, S. Ganguly, A. Banerjee, S. Angizi, and A. Ghosh, “Certifiable safe rlhf: Fixed-penalty constraint optimization for safer language models,”arXiv preprint arXiv:2510.03520, 2025

work page arXiv 2025
[4]

Improving adaptive gameplay in serious games through interactive deep reinforcement learning,

A. Dobrovsky, U. M. Borghoff, and M. Hofmann, “Improving adaptive gameplay in serious games through interactive deep reinforcement learning,” inCognitive infocommunications, theory and applications. Springer, 2018, pp. 411–432

2018
[5]

Data efficient safe reinforcement learning,

S. Padakandla, K. Prabuchandran, S. Ganguly, and S. Bhatnagar, “Data efficient safe reinforcement learning,” in2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022, pp. 1167–1172

2022
[6]

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

T. Kitamura, T. Kozuno, W. Kumagai, K. Hoshino, Y . Hosoe, K. Kasaura, M. Hamaya, P. Parmas, and Y . Matsuo, “Near-optimal policy identification in robust constrained markov decision processes via epigraph form,”arXiv preprint arXiv:2408.16286, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Efficient policy optimization in robust constrained mdps with iteration complexity guarantees,

S. Ganguly, A. Ghosh, K. Panaganti, and A. Wierman, “Efficient policy optimization in robust constrained mdps with iteration complexity guarantees,”arXiv preprint arXiv:2505.19238, 2025

work page arXiv 2025
[8]

Multi-agent reinforcement learning: A selective overview of theories and algorithms,

K. Zhang, Z. Yang, and T. Ba¸ sar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,”Handbook of reinforcement learning and control, pp. 321–384, 2021

2021
[9]

Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning,

S. Curi, I. Bogunovic, and A. Krause, “Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 2254–2264

2021
[10]

Efficient model-based reinforcement learning through optimistic policy search and planning,

S. Curi, F. Berkenkamp, and A. Krause, “Efficient model-based reinforcement learning through optimistic policy search and planning,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 156– 14 170, 2020

2020
[11]

Provably efficient model-free constrained rl with linear function approximation,

A. Ghosh, X. Zhou, and N. Shroff, “Provably efficient model-free constrained rl with linear function approximation,”Advances in Neural Information Processing Systems, vol. 35, pp. 13 303–13 315, 2022

2022
[12]

Last-iterate convergent policy gradient primal-dual methods for constrained mdps,

D. Ding, C.-Y . Wei, K. Zhang, and A. Ribeiro, “Last-iterate convergent policy gradient primal-dual methods for constrained mdps,”arXiv preprint arXiv:2306.11700, 2023

work page arXiv 2023
[13]

Towards achieving sub-linear regret and hard constraint violation in model-free rl,

A. Ghosh, X. Zhou, and N. Shroff, “Towards achieving sub-linear regret and hard constraint violation in model-free rl,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 1054–1062

2024
[14]

Robust dynamic programming,

G. N. Iyengar, “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

2005
[15]

Online robust reinforcement learning with model uncertainty,

Y . Wang and S. Zou, “Online robust reinforcement learning with model uncertainty,”Advances in Neural Information Processing Systems, vol. 34, pp. 7193–7206, 2021

2021
[16]

The curious price of distributional robustness in reinforcement learning with a generative model,

L. Shi, G. Li, Y . Wei, Y . Chen, M. Geist, and Y . Chi, “The curious price of distributional robustness in reinforcement learning with a generative model,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[17]

Robust rein- forcement learning using offline data,

K. Panaganti, Z. Xu, D. Kalathil, and M. Ghavamzadeh, “Robust rein- forcement learning using offline data,”Advances in neural information processing systems, vol. 35, pp. 32 211–32 224, 2022

2022
[18]

Improved sample complexity bounds for distributionally robust reinforcement learning,

Z. Xu, K. Panaganti, and D. Kalathil, “Improved sample complexity bounds for distributionally robust reinforcement learning,” inInterna- tional Conference on Artificial Intelligence and Statistics. PMLR, 2023, pp. 9728–9754

2023
[19]

Policy gradient in robust mdps with global convergence guarantee,

Q. Wang, C. P. Ho, and M. Petrik, “Policy gradient in robust mdps with global convergence guarantee,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 763–35 797

2023
[20]

Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,

E. Vinitsky, Y . Du, K. Parvate, K. Jang, P. Abbeel, and A. Bayen, “Robust reinforcement learning using adversarial populations,”arXiv preprint arXiv:2008.01825, 2020

work page arXiv 2008
[21]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

R. H. Russel, M. Benosman, and J. Van Baar, “Robust constrained-mdps: Soft-constrained robust policy optimization under model uncertainty,” arXiv preprint arXiv:2010.04870, 2020

work page arXiv 2010
[22]

Robust constrained rein- forcement learning for continuous control with model misspecification,

D. J. Mankowitz, D. A. Calian, R. Jeong, C. Paduraru, N. Heess, S. Dathathri, M. Riedmiller, and T. Mann, “Robust constrained rein- forcement learning for continuous control with model misspecification,” arXiv preprint arXiv:2010.10644, 2020

work page arXiv 2010
[23]

Robust constrained reinforcement learning,

Y . Wang, F. Miao, and S. Zou, “Robust constrained reinforcement learning,”arXiv preprint arXiv:2209.06866, 2022

work page arXiv 2022
[24]

Iteration complexity for robust cmdp for finite policy space,

S. Ganguly and A. Ghosh, “Iteration complexity for robust cmdp for finite policy space,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 2713–2719

2025
[25]

On kernelized multi-armed bandits,

S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” inInternational Conference on Machine Learning. PMLR, 2017, pp. 844–853

2017
[26]

doi: 10.1109/TIT.2011.2182033

N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-theoretic regret bounds for gaussian process optimization in the bandit setting,”IEEE Transactions on Information Theory, vol. 58, no. 5, p. 3250–3265, May 2012. [Online]. Available: http://dx.doi.org/10.1109/TIT.2011.2182033

work page doi:10.1109/tit.2011.2182033 2012
[27]

Calibrated model-based deep reinforcement learning,

A. Malik, V . Kuleshov, J. Song, D. Nemer, H. Seymour, and S. Ermon, “Calibrated model-based deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.08312

work page arXiv 2019
[28]

Distributionally robust bayesian optimization,

J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause, “Distributionally robust bayesian optimization,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2174–2184

2020
[29]

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes , publisher =

H. Wei, X. Liu, and L. Ying, “A provably-efficient model-free algorithm for constrained markov decision processes,”arXiv preprint arXiv:2106.01577, 2021

work page arXiv 2021
[30]

Gaussian process optimization in the bandit setting: No regret and experimental design,

N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,”arXiv preprint arXiv:0912.3995, 2009

work page arXiv 2009
[31]

Our rhc-ucrl lemmas and proofs here,

“Our rhc-ucrl lemmas and proofs here,” 2026. [Online]. Available: https: //github.com/Sourav1429/RHC_UCRL/blob/main/RHC_UCRL.pdf

2026
[32]

Natural policy gradient primal-dual method for constrained markov decision processes

D. Ding, K. Zhang, T. Basar, and M. R. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes.” inNeurIPS, 2020

2020
[33]

Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,

K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,”Advances in neural information processing systems, vol. 31, 2018

2018
[34]

Rectified robust policy optimization for model-uncertain constrained reinforcement learning without strong duality,

S. Ma, Z. Chen, Y . Zhou, and H. Huang, “Rectified robust policy optimization for model-uncertain constrained reinforcement learning without strong duality,”arXiv preprint arXiv:2508.17448, 2025

work page arXiv 2025
[35]

Efficient model-based reinforcement learning through optimistic exploration,

C. Qin, C. Sun, K. Zhang, Z. Wang, Z. Yang, J. Shamma, and T. Ba¸ sar, “Efficient model-based reinforcement learning through optimistic exploration,” inAdvances in Neural Information Processing Systems (NeurIPS) 33, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and L. Callaway, Eds., 2020, pp. 18 833–18 844. [Online]. Available: https://proceedings.ne...

2020
[36]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David,Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. APPENDIX Lemma 1: (Adapted from Corollary 1 in [35]) Based on assumptions 1 and 3, for every s,s ′ ∈ S the following ineqlity holds ∥f(s, π(s),¯π(s))−f(s ′ , π(s ′ ),¯π(s ′ ))∥≤L f q 1 +L 2π +L 2 ¯π∥s−s ′ ∥2. (14) Proof: ∥f(s, π(...

2014