arxiv: 2604.04795 · v1 · submitted 2026-04-06 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Sample Complexity for Markov Decision Processes and Stochastic Optimal Control with Static Risk Measures

Cristian Ch\'avez , Yan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:26 UTC · model grok-4.3

classification 🧮 math.OC

keywords Markov decision processesstochastic optimal controlstatic risk measuresstate augmentationsample complexitydynamic programmingconditional value-at-riskdistributionally robust optimization

0 comments

The pith

A simple state augmentation reduces risk-sensitive Markov decision processes and stochastic optimal control to standard dynamic programming problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that for a class of static risk measures applied to total cost, an elementary augmentation of the state space incorporates the risk information while preserving the Markov property. This lets dynamic programming equations be written directly on the augmented space for both Markov decision processes and stochastic optimal control. Sample complexity bounds then follow for finding near-optimal solutions in finite-horizon and infinite-horizon settings. A reader would care because risk measures such as conditional value-at-risk are used to guard against uncertainty in planning and control, yet they usually obstruct standard dynamic programming; the augmentation makes those tools available again. The same reduction is applied to distributionally robust problems generated by phi-divergences.

Core claim

We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by phi-divergences including conditional value-at-risk.

What carries the argument

Elementary state augmentation that folds the static risk measure into an expanded state while keeping the process Markovian.

If this is right

Dynamic programming becomes directly applicable to total-cost risk-sensitive problems in MDPs.
Finite-horizon sample complexity bounds can be stated for approximating optimal policies under the risk measure.
Infinite-horizon sample complexity bounds follow for the same class of problems.
The same bounds and DP reduction hold for stochastic optimal control.
Distributionally robust problems generated by phi-divergences, including CVaR, inherit the DP and sample-complexity results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The augmentation may allow reuse of existing model-based or model-free solvers for risk-averse planning without redesigning the core algorithm.
If the class of admissible risk measures is larger than shown, the same technique could apply to other coherent risk functionals used in finance and robotics.
Sample-complexity results suggest that data requirements grow with the risk parameter in a controllable way, which could guide experiment design for risk-sensitive learning.

Load-bearing premise

The static risk measure must admit an elementary state augmentation that preserves the Markov property and allows dynamic programming without extra assumptions on the risk functional or the underlying dynamics.

What would settle it

Finding a static risk measure in the considered class for which no elementary augmentation exists that keeps the augmented process Markovian and yields dynamic programming equations.

read the original abstract

We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by $\phi$-divergences including conditional value-at-risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

State augmentation turns static risk measures into DP problems on a bigger space, but sample complexity claims for CVaR run into trouble because the auxiliary state is continuous.

read the letter

The main point is that the authors show how to augment the state space so that static risk measures applied to total cost become amenable to standard dynamic programming in both MDPs and stochastic optimal control. They then use that to talk about sample complexity in finite-horizon and infinite-horizon settings, with an example on phi-divergence risks including CVaR. They do a decent job laying out the augmentation method in an elementary way. It preserves the Markov property and lets them derive the DP equations on the new space without heavy extra assumptions. That part looks useful for turning risk-sensitive problems into more standard ones. The weaker part is the sample complexity discussion. For CVaR the auxiliary state is typically continuous, like a running threshold or tail probability. Once you augment with a continuous variable the state space stops being finite. Most sample complexity results for MDPs rely on finite states to get bounds in terms of |S| and |A|. Without discretization, covering arguments, or approximation error controls, those bounds don't transfer. The abstract does not mention any such steps, so if the paper does not supply them the complexity claims are not fully supported. This paper is aimed at researchers in risk-aware control and robust RL who want finite-sample results for distributionally robust problems. A reader working on similar augmentation techniques or sample bounds for risk measures would get something out of it, provided the proofs address the continuous-state issue. I would recommend sending it for peer review. The core construction is straightforward and the topic is timely, even if the sample complexity section needs strengthening or clarification on how they handle the continuous augmentation.

Referee Report

2 major / 2 minor

Summary. The paper presents an elementary state augmentation method for a class of static risk measures (including those generated by φ-divergences and CVaR) applied to total cost in MDPs and stochastic optimal control. This allows derivation of dynamic programming equations on the augmented space. The authors then discuss sample complexities for finite- and infinite-horizon settings and demonstrate the approach on distributionally robust functionals.

Significance. If the augmentation is shown to be Markov-preserving with rigorous error control, the framework could unify static risk measures with DP and sample-complexity analysis in risk-sensitive control, a useful contribution given the growing interest in robust MDPs. The explicit treatment of φ-divergences is a strength, but overall significance depends on whether the continuous-state issues raised by the augmentation are resolved with concrete bounds.

major comments (2)

[§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.
[§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.

minor comments (2)

[§2] Notation for the augmented state (e.g., the auxiliary component) is introduced without a clear table or diagram contrasting the original and augmented processes.
[Introduction] Several references to prior work on risk-sensitive MDPs (e.g., on CVaR augmentation) are cited but not compared in detail to the proposed elementary method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below and indicate the revisions we plan to make to the manuscript.

read point-by-point responses

Referee: [§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.

Authors: We agree that the auxiliary variable for CVaR augmentation is continuous-valued and that §5 applies finite-state PAC bounds directly to the augmented process. In the revised manuscript we will add a dedicated subsection in §5 that supplies covering-number arguments for the continuous component together with explicit discretization-error bounds. These bounds will be derived from standard Lipschitz assumptions on the risk functional and will show that the additional sample complexity incurred by discretization is only logarithmic in the desired accuracy, thereby justifying the stated rates for the augmented state space. revision: yes
Referee: [§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.

Authors: The augmentation construction is intended to be elementary and Markov-preserving for the entire class of static risk measures generated by φ-divergences. While the CVaR case is worked out in detail, the same pattern—augmenting the state with the dual variables or threshold parameters of the divergence ball—extends directly to general φ-divergences. In the revision we will expand §4 to state this construction explicitly for an arbitrary φ-divergence, confirming that the Markov property is preserved under precisely the same assumptions on the transition kernel that are already used for the finite-horizon case, without requiring extra structural conditions on the dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained

full rationale

The abstract outlines a state augmentation technique for static risk measures (including phi-divergences and CVaR) that preserves the Markov property and yields dynamic programming equations on the augmented space, followed by sample-complexity analysis for finite- and infinite-horizon MDPs and SOC. No equations, fitted parameters, self-citations, or uniqueness theorems are visible that would reduce any claimed prediction or result to an input by construction. The approach is presented as introducing an elementary augmentation without additional structural assumptions, making the central claim independent of the inputs it analyzes. This qualifies as a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5362 in / 988 out tokens · 34377 ms · 2026-05-10T19:26:44.246529+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present an elementary state augmentation method for a class of static risk measure applied to the total cost ... such that dynamic programming equations can be derived on the augmented space.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2.1 (Finite-horizon augmented MDP) ... ect_θ(st, xt, at) = f_θ(xt + c_t(st, at)) − f_θ(xt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences
math.OC 2026-04 unverdicted novelty 7.0

For φ-divergences with superlinear growth, sample average approximation achieves P-independent sample complexity for worst-case expectation estimation depending only on φ's growth, ball radius and precision, with opti...

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · cited by 1 Pith paper

[1]

Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022

Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022. 19

2022
[2]

Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011

Nicole B¨ auerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011

2011
[3]

More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014

Nicole B¨ auerle and Ulrich Rieder. More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014

2014
[4]

Data-driven stochastic programming using phi-divergences

G¨ uzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pages 1–19. Informs, 2015

2015
[5]

Decomposition and partitioning methods for multistage stochastic linear programs

John R Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations research, 33(5):989–1007, 1985

1985
[6]

Quantifying distributional model risk via optimal transport

Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019

2019
[7]

Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014

Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014

2014
[8]

Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015

2015
[9]

Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022

Rui Ding and Eugene A Feinberg. Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022

2022
[10]

Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020

Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020

2020
[11]

Distributionally robust stochastic optimization with wasserstein distance

Rui Gao and Anton Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655, 2023

2023
[12]

On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023

Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023

2023
[13]

Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

2005
[14]

Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022

Guanghui Lan. Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022

2022
[15]

First-Order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, and Tuo Zhao. First-order policy optimization for robust markov decision process.arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022
[16]

Rectangularity and duality of distributionally robust markov decision processes: Y

Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes: Y. li, a. shapiro.Mathematical Programming, pages 1–42, 2025

2025
[17]

Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018

2018
[18]

Reward redistribu- tion for cvar mdps using a bellman operator on l-infinity.arXiv preprint arXiv:2602.03778, 2026

Aneri Muni, Vincent Taboga, Esther Derman, Pierre-Luc Bacon, and Erick Delage. Reward redistribu- tion for cvar mdps using a bellman operator on l-infinity.arXiv preprint arXiv:2602.03778, 2026

work page arXiv 2026
[19]

Risk-sensitive reward-free reinforcement learning with cvar

Xinyi Ni, Guanlin Liu, and Lifeng Lai. Risk-sensitive reward-free reinforcement learning with cvar. In Forty-first International Conference on Machine Learning, 2024

2024
[20]

Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005. 20

2005
[21]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–
[22]

Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991

Mario VF Pereira and Leontina MVG Pinto. Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991

1991
[23]

John Wiley & Sons, 2014

Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

2014
[24]

Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010

Andrzej Ruszczy´ nski. Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010

2010
[25]

On a time consistency concept in risk averse multistage stochastic programming

Alexander Shapiro. On a time consistency concept in risk averse multistage stochastic programming. Operations Research Letters, 37(3):143–147, 2009

2009
[26]

Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

2017
[27]

SIAM, 2021

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic programming: modeling and theory. SIAM, 2021

2021
[28]

Risk-averse formulations of stochastic optimal control and markov decision processes.arXiv preprint arXiv:2505.16651, 2025

Alexander Shapiro and Yan Li. Risk-averse formulations of stochastic optimal control and markov decision processes.arXiv preprint arXiv:2505.16651, 2025

work page arXiv 2025
[29]

Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

work page arXiv 2018
[30]

Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar

Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar. InInternational Conference on Machine Learning, pages 35864–35907. PMLR, 2023

2023
[31]

InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15503–15514

Kaiwen Wang, Dawen Liang, Nathan Kallus, and Wen Sun. A reductions approach to risk-sensitive reinforcement learning with optimized certainty equivalents.arXiv preprint arXiv:2403.06323, 2024

work page arXiv 2024
[32]

Policy gradient in robust mdps with global convergence guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust mdps with global convergence guarantee. InInternational conference on machine learning, pages 35763–35797. PMLR, 2023

2023
[33]

A finite sample complexity bound for distributionally robust q-learning

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InInternational conference on artificial intelligence and statistics, pages 3370–3398. PMLR, 2023

2023
[34]

Policy gradient method for robust reinforcement learning

Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational conference on machine learning, pages 23484–23526. PMLR, 2022

2022
[35]

Risk-sensitive markov decision process and learning under general utility functions.arXiv preprint arXiv:2311.13589, 2023

Zhengqi Wu and Renyuan Xu. Risk-sensitive markov decision process and learning under general utility functions.arXiv preprint arXiv:2311.13589, 2023. A Supplementary Proofs Lemma A.1.Let (Ω,F, P) be a measurable space, andX: Ω→[0, B] beF-measurable whereB >0. Consider R(X) = sup ζ∈D Z Ω X(ω)ζ(ω)dP(ω),s.t. ζ∈D, Z Ω ϕ(ζ(w))dP(w)≤τ.(A.1) whereDdenotes the se...

work page arXiv 2023