Recognition: unknown
Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3
The pith
An algorithm learns policies that remain optimal and safe when state transitions depend on both the agent's actions and an explicit adversarial policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that safety-constrained RL under explicit adversarial dynamics admits an algorithm, Robust Hallucinated Constrained Upper-Confidence RL (RHC-UCRL), that maintains optimism over both agent and adversary policies, separates epistemic from aleatoric uncertainty, and delivers sub-linear regret together with sub-linear constraint violation bounds.
What carries the argument
RHC-UCRL, which maintains optimism over both agent and adversary policies by constructing hallucinated transition models that account for the worst-case exogenous action.
Load-bearing premise
Exogenous factors can be represented as an adversarial policy that the algorithm can optimize against optimistically without needing strong assumptions on how far that policy lies from a known nominal model.
What would settle it
Run RHC-UCRL in a controlled environment where an adversary is explicitly programmed to maximize constraint violations and measure whether cumulative regret and violations stay bounded by a sublinear function of the number of steps.
Figures
read the original abstract
Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+\omega_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $\omega_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\bar{\pi}$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RHC-UCRL, a model-based algorithm for safety-constrained RL in MDPs where transitions are co-determined by the agent's policy and an explicit adversarial policy representing exogenous factors (s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h). It maintains optimism over both agent and adversary policies while separating epistemic from aleatoric uncertainty, claims sub-linear regret and constraint violation guarantees, and positions itself as the first work to study this setting without nominal-model divergence assumptions.
Significance. If the dual-optimism construction and uncertainty separation rigorously deliver the stated bounds, the work would meaningfully extend robust RL to explicitly strategic adversaries in safety-critical domains. The explicit adversarial-policy modeling and lack of strong divergence assumptions are potential strengths relative to distributional-robustness baselines.
major comments (1)
- [Abstract] Abstract (central claim): the construction maintains optimism over both agent and adversary policies yet asserts constraint-violation guarantees against a strategic bar{pi}. For the bound to hold under s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h, optimism on bar{pi} must still induce a sufficiently pessimistic estimate of constraint satisfaction; otherwise residual model error can be exploited by the adversary. No derivation or uncertainty-set construction is shown that resolves this tension while preserving sub-linear violation.
minor comments (1)
- [Abstract] The abstract states 'sub-linear regret and constraint violation guarantees' without indicating the dependence on horizon H, state-action space size, or confidence parameters; explicit rates would clarify the result.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for pinpointing a key tension in how our dual-optimism construction interacts with constraint-violation bounds. We address the concern directly below and indicate where revisions will be made for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract (central claim): the construction maintains optimism over both agent and adversary policies yet asserts constraint-violation guarantees against a strategic bar{pi}. For the bound to hold under s_{h+1} = f(s_h, a_h, bar{a}_h) + omega_h, optimism on bar{pi} must still induce a sufficiently pessimistic estimate of constraint satisfaction; otherwise residual model error can be exploited by the adversary. No derivation or uncertainty-set construction is shown that resolves this tension while preserving sub-linear violation.
Authors: We appreciate the referee highlighting this subtlety. In RHC-UCRL the uncertainty sets are constructed separately for the agent and adversary components of the transition function f. Epistemic uncertainty is isolated via Hoeffding-style concentration on the empirical estimates of f, while aleatoric noise omega_h is handled by explicit variance terms. Optimism for the agent selects policies that maximize a lower confidence bound on reward minus a penalty for constraint violation. For the adversary, the same uncertainty set is used but the constraint value function takes the minimum (worst-case) realization over admissible bar{a} within the set; this induces the required pessimism for constraint satisfaction even though the policy itself is chosen optimistically for exploration. The resulting high-probability bound on cumulative violation is sub-linear (O(sqrt(T log(1/delta)))) and is derived in Section 4.2 (uncertainty-set construction) and Theorem 3 (regret and violation analysis), with the full proof in Appendix C. We agree the abstract statement is terse on this mechanism and will revise it to explicitly note that adversary optimism is taken inside a pessimistic constraint evaluation. revision: partial
Circularity Check
No circularity detected; claims rest on proposed algorithm without self-referential reductions in available text.
full rationale
The abstract and problem setup introduce RHC-UCRL as a novel model-based algorithm maintaining optimism over both agent and adversary policies, with sub-linear regret and violation guarantees. No equations, derivations, or fitted parameters are presented that reduce predictions to inputs by construction. The 'first work' claim is a novelty assertion independent of self-citations. No load-bearing steps (self-definitional, fitted-input, or self-citation chains) are identifiable from the provided text, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Exogenous factors can be modeled as an adversarial policy co-determining transitions
invented entities (1)
-
RHC-UCRL algorithm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep reinforcement learning for robotics: A survey of real- world successes,
C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Martín-Martín, and P. Stone, “Deep reinforcement learning for robotics: A survey of real- world successes,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 8, no. 1, pp. 153–188, 2025
2025
-
[2]
Combined deep and reinforcement learning with gaming to promote healthcare in neurodevelopmental disorders: A new hypothesis,
F. Stasolla, A. Passaro, E. Curcio, M. Di Gioia, A. Zullo, M. Dragone, and E. Martini, “Combined deep and reinforcement learning with gaming to promote healthcare in neurodevelopmental disorders: A new hypothesis,”Frontiers in Human Neuroscience, vol. 19, p. 1557826, 2025
2025
-
[3]
Certifiable safe rlhf: Fixed-penalty constraint optimization for safer language models,
K. Pandit, S. Ganguly, A. Banerjee, S. Angizi, and A. Ghosh, “Certifiable safe rlhf: Fixed-penalty constraint optimization for safer language models,”arXiv preprint arXiv:2510.03520, 2025
-
[4]
Improving adaptive gameplay in serious games through interactive deep reinforcement learning,
A. Dobrovsky, U. M. Borghoff, and M. Hofmann, “Improving adaptive gameplay in serious games through interactive deep reinforcement learning,” inCognitive infocommunications, theory and applications. Springer, 2018, pp. 411–432
2018
-
[5]
Data efficient safe reinforcement learning,
S. Padakandla, K. Prabuchandran, S. Ganguly, and S. Bhatnagar, “Data efficient safe reinforcement learning,” in2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022, pp. 1167–1172
2022
-
[6]
Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
T. Kitamura, T. Kozuno, W. Kumagai, K. Hoshino, Y . Hosoe, K. Kasaura, M. Hamaya, P. Parmas, and Y . Matsuo, “Near-optimal policy identification in robust constrained markov decision processes via epigraph form,”arXiv preprint arXiv:2408.16286, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Efficient policy optimization in robust constrained mdps with iteration complexity guarantees,
S. Ganguly, A. Ghosh, K. Panaganti, and A. Wierman, “Efficient policy optimization in robust constrained mdps with iteration complexity guarantees,”arXiv preprint arXiv:2505.19238, 2025
-
[8]
Multi-agent reinforcement learning: A selective overview of theories and algorithms,
K. Zhang, Z. Yang, and T. Ba¸ sar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,”Handbook of reinforcement learning and control, pp. 321–384, 2021
2021
-
[9]
Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning,
S. Curi, I. Bogunovic, and A. Krause, “Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 2254–2264
2021
-
[10]
Efficient model-based reinforcement learning through optimistic policy search and planning,
S. Curi, F. Berkenkamp, and A. Krause, “Efficient model-based reinforcement learning through optimistic policy search and planning,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 156– 14 170, 2020
2020
-
[11]
Provably efficient model-free constrained rl with linear function approximation,
A. Ghosh, X. Zhou, and N. Shroff, “Provably efficient model-free constrained rl with linear function approximation,”Advances in Neural Information Processing Systems, vol. 35, pp. 13 303–13 315, 2022
2022
-
[12]
Last-iterate convergent policy gradient primal-dual methods for constrained mdps,
D. Ding, C.-Y . Wei, K. Zhang, and A. Ribeiro, “Last-iterate convergent policy gradient primal-dual methods for constrained mdps,”arXiv preprint arXiv:2306.11700, 2023
-
[13]
Towards achieving sub-linear regret and hard constraint violation in model-free rl,
A. Ghosh, X. Zhou, and N. Shroff, “Towards achieving sub-linear regret and hard constraint violation in model-free rl,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 1054–1062
2024
-
[14]
Robust dynamic programming,
G. N. Iyengar, “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005
2005
-
[15]
Online robust reinforcement learning with model uncertainty,
Y . Wang and S. Zou, “Online robust reinforcement learning with model uncertainty,”Advances in Neural Information Processing Systems, vol. 34, pp. 7193–7206, 2021
2021
-
[16]
The curious price of distributional robustness in reinforcement learning with a generative model,
L. Shi, G. Li, Y . Wei, Y . Chen, M. Geist, and Y . Chi, “The curious price of distributional robustness in reinforcement learning with a generative model,”Advances in Neural Information Processing Systems, vol. 36, 2024
2024
-
[17]
Robust rein- forcement learning using offline data,
K. Panaganti, Z. Xu, D. Kalathil, and M. Ghavamzadeh, “Robust rein- forcement learning using offline data,”Advances in neural information processing systems, vol. 35, pp. 32 211–32 224, 2022
2022
-
[18]
Improved sample complexity bounds for distributionally robust reinforcement learning,
Z. Xu, K. Panaganti, and D. Kalathil, “Improved sample complexity bounds for distributionally robust reinforcement learning,” inInterna- tional Conference on Artificial Intelligence and Statistics. PMLR, 2023, pp. 9728–9754
2023
-
[19]
Policy gradient in robust mdps with global convergence guarantee,
Q. Wang, C. P. Ho, and M. Petrik, “Policy gradient in robust mdps with global convergence guarantee,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 763–35 797
2023
-
[20]
Robust reinforcement learning using adversarial populations.arXiv preprint arXiv:2008.01825,
E. Vinitsky, Y . Du, K. Parvate, K. Jang, P. Abbeel, and A. Bayen, “Robust reinforcement learning using adversarial populations,”arXiv preprint arXiv:2008.01825, 2020
-
[21]
Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty
R. H. Russel, M. Benosman, and J. Van Baar, “Robust constrained-mdps: Soft-constrained robust policy optimization under model uncertainty,” arXiv preprint arXiv:2010.04870, 2020
-
[22]
Robust constrained rein- forcement learning for continuous control with model misspecification,
D. J. Mankowitz, D. A. Calian, R. Jeong, C. Paduraru, N. Heess, S. Dathathri, M. Riedmiller, and T. Mann, “Robust constrained rein- forcement learning for continuous control with model misspecification,” arXiv preprint arXiv:2010.10644, 2020
-
[23]
Robust constrained reinforcement learning,
Y . Wang, F. Miao, and S. Zou, “Robust constrained reinforcement learning,”arXiv preprint arXiv:2209.06866, 2022
-
[24]
Iteration complexity for robust cmdp for finite policy space,
S. Ganguly and A. Ghosh, “Iteration complexity for robust cmdp for finite policy space,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 2713–2719
2025
-
[25]
On kernelized multi-armed bandits,
S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” inInternational Conference on Machine Learning. PMLR, 2017, pp. 844–853
2017
-
[26]
N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-theoretic regret bounds for gaussian process optimization in the bandit setting,”IEEE Transactions on Information Theory, vol. 58, no. 5, p. 3250–3265, May 2012. [Online]. Available: http://dx.doi.org/10.1109/TIT.2011.2182033
-
[27]
Calibrated model-based deep reinforcement learning,
A. Malik, V . Kuleshov, J. Song, D. Nemer, H. Seymour, and S. Ermon, “Calibrated model-based deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.08312
-
[28]
Distributionally robust bayesian optimization,
J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause, “Distributionally robust bayesian optimization,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2174–2184
2020
-
[29]
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes , publisher =
H. Wei, X. Liu, and L. Ying, “A provably-efficient model-free algorithm for constrained markov decision processes,”arXiv preprint arXiv:2106.01577, 2021
-
[30]
Gaussian process optimization in the bandit setting: No regret and experimental design,
N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,”arXiv preprint arXiv:0912.3995, 2009
-
[31]
Our rhc-ucrl lemmas and proofs here,
“Our rhc-ucrl lemmas and proofs here,” 2026. [Online]. Available: https: //github.com/Sourav1429/RHC_UCRL/blob/main/RHC_UCRL.pdf
2026
-
[32]
Natural policy gradient primal-dual method for constrained markov decision processes
D. Ding, K. Zhang, T. Basar, and M. R. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes.” inNeurIPS, 2020
2020
-
[33]
Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,
K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[34]
S. Ma, Z. Chen, Y . Zhou, and H. Huang, “Rectified robust policy optimization for model-uncertain constrained reinforcement learning without strong duality,”arXiv preprint arXiv:2508.17448, 2025
-
[35]
Efficient model-based reinforcement learning through optimistic exploration,
C. Qin, C. Sun, K. Zhang, Z. Wang, Z. Yang, J. Shamma, and T. Ba¸ sar, “Efficient model-based reinforcement learning through optimistic exploration,” inAdvances in Neural Information Processing Systems (NeurIPS) 33, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and L. Callaway, Eds., 2020, pp. 18 833–18 844. [Online]. Available: https://proceedings.ne...
2020
-
[36]
Shalev-Shwartz and S
S. Shalev-Shwartz and S. Ben-David,Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. APPENDIX Lemma 1: (Adapted from Corollary 1 in [35]) Based on assumptions 1 and 3, for every s,s ′ ∈ S the following ineqlity holds ∥f(s, π(s),¯π(s))−f(s ′ , π(s ′ ),¯π(s ′ ))∥≤L f q 1 +L 2π +L 2 ¯π∥s−s ′ ∥2. (14) Proof: ∥f(s, π(...
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.