Recognition: 2 theorem links
· Lean TheoremSample Complexity for Markov Decision Processes and Stochastic Optimal Control with Static Risk Measures
Pith reviewed 2026-05-10 19:26 UTC · model grok-4.3
The pith
A simple state augmentation reduces risk-sensitive Markov decision processes and stochastic optimal control to standard dynamic programming problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by phi-divergences including conditional value-at-risk.
What carries the argument
Elementary state augmentation that folds the static risk measure into an expanded state while keeping the process Markovian.
If this is right
- Dynamic programming becomes directly applicable to total-cost risk-sensitive problems in MDPs.
- Finite-horizon sample complexity bounds can be stated for approximating optimal policies under the risk measure.
- Infinite-horizon sample complexity bounds follow for the same class of problems.
- The same bounds and DP reduction hold for stochastic optimal control.
- Distributionally robust problems generated by phi-divergences, including CVaR, inherit the DP and sample-complexity results.
Where Pith is reading between the lines
- The augmentation may allow reuse of existing model-based or model-free solvers for risk-averse planning without redesigning the core algorithm.
- If the class of admissible risk measures is larger than shown, the same technique could apply to other coherent risk functionals used in finance and robotics.
- Sample-complexity results suggest that data requirements grow with the risk parameter in a controllable way, which could guide experiment design for risk-sensitive learning.
Load-bearing premise
The static risk measure must admit an elementary state augmentation that preserves the Markov property and allows dynamic programming without extra assumptions on the risk functional or the underlying dynamics.
What would settle it
Finding a static risk measure in the considered class for which no elementary augmentation exists that keeps the augmented process Markovian and yields dynamic programming equations.
read the original abstract
We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by $\phi$-divergences including conditional value-at-risk.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an elementary state augmentation method for a class of static risk measures (including those generated by φ-divergences and CVaR) applied to total cost in MDPs and stochastic optimal control. This allows derivation of dynamic programming equations on the augmented space. The authors then discuss sample complexities for finite- and infinite-horizon settings and demonstrate the approach on distributionally robust functionals.
Significance. If the augmentation is shown to be Markov-preserving with rigorous error control, the framework could unify static risk measures with DP and sample-complexity analysis in risk-sensitive control, a useful contribution given the growing interest in robust MDPs. The explicit treatment of φ-divergences is a strength, but overall significance depends on whether the continuous-state issues raised by the augmentation are resolved with concrete bounds.
major comments (2)
- [§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.
- [§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.
minor comments (2)
- [§2] Notation for the augmented state (e.g., the auxiliary component) is introduced without a clear table or diagram contrasting the original and augmented processes.
- [Introduction] Several references to prior work on risk-sensitive MDPs (e.g., on CVaR augmentation) are cited but not compared in detail to the proposed elementary method.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We address each major comment below and indicate the revisions we plan to make to the manuscript.
read point-by-point responses
-
Referee: [§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.
Authors: We agree that the auxiliary variable for CVaR augmentation is continuous-valued and that §5 applies finite-state PAC bounds directly to the augmented process. In the revised manuscript we will add a dedicated subsection in §5 that supplies covering-number arguments for the continuous component together with explicit discretization-error bounds. These bounds will be derived from standard Lipschitz assumptions on the risk functional and will show that the additional sample complexity incurred by discretization is only logarithmic in the desired accuracy, thereby justifying the stated rates for the augmented state space. revision: yes
-
Referee: [§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.
Authors: The augmentation construction is intended to be elementary and Markov-preserving for the entire class of static risk measures generated by φ-divergences. While the CVaR case is worked out in detail, the same pattern—augmenting the state with the dual variables or threshold parameters of the divergence ball—extends directly to general φ-divergences. In the revision we will expand §4 to state this construction explicitly for an arbitrary φ-divergence, confirming that the Markov property is preserved under precisely the same assumptions on the transition kernel that are already used for the finite-horizon case, without requiring extra structural conditions on the dynamics. revision: yes
Circularity Check
No circularity detected; derivation self-contained
full rationale
The abstract outlines a state augmentation technique for static risk measures (including phi-divergences and CVaR) that preserves the Markov property and yields dynamic programming equations on the augmented space, followed by sample-complexity analysis for finite- and infinite-horizon MDPs and SOC. No equations, fitted parameters, self-citations, or uniqueness theorems are visible that would reduce any claimed prediction or result to an input by construction. The approach is presented as introducing an elementary augmentation without additional structural assumptions, making the central claim independent of the inputs it analyzes. This qualifies as a normal non-finding of circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present an elementary state augmentation method for a class of static risk measure applied to the total cost ... such that dynamic programming equations can be derived on the augmented space.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 2.1 (Finite-horizon augmented MDP) ... ect_θ(st, xt, at) = f_θ(xt + c_t(st, at)) − f_θ(xt)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences
For φ-divergences with superlinear growth, sample average approximation achieves P-independent sample complexity for worst-case expectation estimation depending only on φ's growth, ball radius and precision, with opti...
Reference graph
Works this paper leans on
-
[1]
Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022
Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022. 19
2022
-
[2]
Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011
Nicole B¨ auerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011
2011
-
[3]
More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014
Nicole B¨ auerle and Ulrich Rieder. More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014
2014
-
[4]
Data-driven stochastic programming using phi-divergences
G¨ uzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pages 1–19. Informs, 2015
2015
-
[5]
Decomposition and partitioning methods for multistage stochastic linear programs
John R Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations research, 33(5):989–1007, 1985
1985
-
[6]
Quantifying distributional model risk via optimal transport
Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019
2019
-
[7]
Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014
Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014
2014
-
[8]
Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015
Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015
2015
-
[9]
Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022
Rui Ding and Eugene A Feinberg. Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022
2022
-
[10]
Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020
Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020
2020
-
[11]
Distributionally robust stochastic optimization with wasserstein distance
Rui Gao and Anton Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655, 2023
2023
-
[12]
On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023
Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023
2023
-
[13]
Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005
Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005
2005
-
[14]
Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022
Guanghui Lan. Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022
2022
-
[15]
First-Order Policy Optimization for Robust Markov Decision Process
Yan Li, Guanghui Lan, and Tuo Zhao. First-order policy optimization for robust markov decision process.arXiv preprint arXiv:2209.10579, 2022
-
[16]
Rectangularity and duality of distributionally robust markov decision processes: Y
Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes: Y. li, a. shapiro.Mathematical Programming, pages 1–42, 2025
2025
-
[17]
Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018
Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018
2018
-
[18]
Aneri Muni, Vincent Taboga, Esther Derman, Pierre-Luc Bacon, and Erick Delage. Reward redistribu- tion for cvar mdps using a bellman operator on l-infinity.arXiv preprint arXiv:2602.03778, 2026
-
[19]
Risk-sensitive reward-free reinforcement learning with cvar
Xinyi Ni, Guanlin Liu, and Lifeng Lai. Risk-sensitive reward-free reinforcement learning with cvar. In Forty-first International Conference on Machine Learning, 2024
2024
-
[20]
Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005
Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005. 20
2005
-
[21]
Sample complexity of robust reinforcement learning with a generative model
Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–
-
[22]
Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991
Mario VF Pereira and Leontina MVG Pinto. Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991
1991
-
[23]
John Wiley & Sons, 2014
Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
2014
-
[24]
Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010
Andrzej Ruszczy´ nski. Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010
2010
-
[25]
On a time consistency concept in risk averse multistage stochastic programming
Alexander Shapiro. On a time consistency concept in risk averse multistage stochastic programming. Operations Research Letters, 37(3):143–147, 2009
2009
-
[26]
Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017
Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017
2017
-
[27]
SIAM, 2021
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic programming: modeling and theory. SIAM, 2021
2021
-
[28]
Alexander Shapiro and Yan Li. Risk-averse formulations of stochastic optimal control and markov decision processes.arXiv preprint arXiv:2505.16651, 2025
-
[29]
Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018
-
[30]
Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar
Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar. InInternational Conference on Machine Learning, pages 35864–35907. PMLR, 2023
2023
-
[31]
InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15503–15514
Kaiwen Wang, Dawen Liang, Nathan Kallus, and Wen Sun. A reductions approach to risk-sensitive reinforcement learning with optimized certainty equivalents.arXiv preprint arXiv:2403.06323, 2024
-
[32]
Policy gradient in robust mdps with global convergence guarantee
Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust mdps with global convergence guarantee. InInternational conference on machine learning, pages 35763–35797. PMLR, 2023
2023
-
[33]
A finite sample complexity bound for distributionally robust q-learning
Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InInternational conference on artificial intelligence and statistics, pages 3370–3398. PMLR, 2023
2023
-
[34]
Policy gradient method for robust reinforcement learning
Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational conference on machine learning, pages 23484–23526. PMLR, 2022
2022
-
[35]
Zhengqi Wu and Renyuan Xu. Risk-sensitive markov decision process and learning under general utility functions.arXiv preprint arXiv:2311.13589, 2023. A Supplementary Proofs Lemma A.1.Let (Ω,F, P) be a measurable space, andX: Ω→[0, B] beF-measurable whereB >0. Consider R(X) = sup ζ∈D Z Ω X(ω)ζ(ω)dP(ω),s.t. ζ∈D, Z Ω ϕ(ζ(w))dP(w)≤τ.(A.1) whereDdenotes the se...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.