pith. machine review for the scientific record. sign in

arxiv: 2604.04795 · v1 · submitted 2026-04-06 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Sample Complexity for Markov Decision Processes and Stochastic Optimal Control with Static Risk Measures

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:26 UTC · model grok-4.3

classification 🧮 math.OC
keywords Markov decision processesstochastic optimal controlstatic risk measuresstate augmentationsample complexitydynamic programmingconditional value-at-riskdistributionally robust optimization
0
0 comments X

The pith

A simple state augmentation reduces risk-sensitive Markov decision processes and stochastic optimal control to standard dynamic programming problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that for a class of static risk measures applied to total cost, an elementary augmentation of the state space incorporates the risk information while preserving the Markov property. This lets dynamic programming equations be written directly on the augmented space for both Markov decision processes and stochastic optimal control. Sample complexity bounds then follow for finding near-optimal solutions in finite-horizon and infinite-horizon settings. A reader would care because risk measures such as conditional value-at-risk are used to guard against uncertainty in planning and control, yet they usually obstruct standard dynamic programming; the augmentation makes those tools available again. The same reduction is applied to distributionally robust problems generated by phi-divergences.

Core claim

We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by phi-divergences including conditional value-at-risk.

What carries the argument

Elementary state augmentation that folds the static risk measure into an expanded state while keeping the process Markovian.

If this is right

  • Dynamic programming becomes directly applicable to total-cost risk-sensitive problems in MDPs.
  • Finite-horizon sample complexity bounds can be stated for approximating optimal policies under the risk measure.
  • Infinite-horizon sample complexity bounds follow for the same class of problems.
  • The same bounds and DP reduction hold for stochastic optimal control.
  • Distributionally robust problems generated by phi-divergences, including CVaR, inherit the DP and sample-complexity results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The augmentation may allow reuse of existing model-based or model-free solvers for risk-averse planning without redesigning the core algorithm.
  • If the class of admissible risk measures is larger than shown, the same technique could apply to other coherent risk functionals used in finance and robotics.
  • Sample-complexity results suggest that data requirements grow with the risk parameter in a controllable way, which could guide experiment design for risk-sensitive learning.

Load-bearing premise

The static risk measure must admit an elementary state augmentation that preserves the Markov property and allows dynamic programming without extra assumptions on the risk functional or the underlying dynamics.

What would settle it

Finding a static risk measure in the considered class for which no elementary augmentation exists that keeps the augmented process Markovian and yields dynamic programming equations.

read the original abstract

We present an elementary state augmentation method for a class of static risk measure applied to the total cost for both Markov decision processes and stochastic optimal control, such that dynamic programming equations can be derived on the augmented space. Through this we discuss the sample complexities of these two problems for both finite-horizon and infinite-horizon settings. We demonstrate the application of the proposed approach through studying distributionally robust functional generated by $\phi$-divergences including conditional value-at-risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an elementary state augmentation method for a class of static risk measures (including those generated by φ-divergences and CVaR) applied to total cost in MDPs and stochastic optimal control. This allows derivation of dynamic programming equations on the augmented space. The authors then discuss sample complexities for finite- and infinite-horizon settings and demonstrate the approach on distributionally robust functionals.

Significance. If the augmentation is shown to be Markov-preserving with rigorous error control, the framework could unify static risk measures with DP and sample-complexity analysis in risk-sensitive control, a useful contribution given the growing interest in robust MDPs. The explicit treatment of φ-divergences is a strength, but overall significance depends on whether the continuous-state issues raised by the augmentation are resolved with concrete bounds.

major comments (2)
  1. [§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.
  2. [§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.
minor comments (2)
  1. [§2] Notation for the augmented state (e.g., the auxiliary component) is introduced without a clear table or diagram contrasting the original and augmented processes.
  2. [Introduction] Several references to prior work on risk-sensitive MDPs (e.g., on CVaR augmentation) are cited but not compared in detail to the proposed elementary method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below and indicate the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [§3 and §5] §3 (state augmentation for CVaR): the auxiliary variable tracking tail mass or threshold is continuous-valued, yet the sample-complexity claims in §5 rely on finite-state PAC bounds (e.g., covering |S|·|A|) without supplying discretization-error or covering-number arguments. This directly affects the central sample-complexity discussion.

    Authors: We agree that the auxiliary variable for CVaR augmentation is continuous-valued and that §5 applies finite-state PAC bounds directly to the augmented process. In the revised manuscript we will add a dedicated subsection in §5 that supplies covering-number arguments for the continuous component together with explicit discretization-error bounds. These bounds will be derived from standard Lipschitz assumptions on the risk functional and will show that the additional sample complexity incurred by discretization is only logarithmic in the desired accuracy, thereby justifying the stated rates for the augmented state space. revision: yes

  2. Referee: [§4] §4 (infinite-horizon case): the derivation of the DP equation on the augmented space assumes the risk functional admits an elementary augmentation that preserves the Markov property without further structural assumptions on the dynamics or the risk measure; the paper does not verify this for general φ-divergences beyond the CVaR example.

    Authors: The augmentation construction is intended to be elementary and Markov-preserving for the entire class of static risk measures generated by φ-divergences. While the CVaR case is worked out in detail, the same pattern—augmenting the state with the dual variables or threshold parameters of the divergence ball—extends directly to general φ-divergences. In the revision we will expand §4 to state this construction explicitly for an arbitrary φ-divergence, confirming that the Markov property is preserved under precisely the same assumptions on the transition kernel that are already used for the finite-horizon case, without requiring extra structural conditions on the dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained

full rationale

The abstract outlines a state augmentation technique for static risk measures (including phi-divergences and CVaR) that preserves the Markov property and yields dynamic programming equations on the augmented space, followed by sample-complexity analysis for finite- and infinite-horizon MDPs and SOC. No equations, fitted parameters, self-citations, or uniqueness theorems are visible that would reduce any claimed prediction or result to an input by construction. The approach is presented as introducing an elementary augmentation without additional structural assumptions, making the central claim independent of the inputs it analyzes. This qualifies as a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5362 in / 988 out tokens · 34377 ms · 2026-05-10T19:26:44.246529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences

    math.OC 2026-04 unverdicted novelty 7.0

    For φ-divergences with superlinear growth, sample average approximation achieves P-independent sample complexity for worst-case expectation estimation depending only on φ's growth, ball radius and precision, with opti...

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · cited by 1 Pith paper

  1. [1]

    Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022

    Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning.Advances in Neural Information Processing Systems, 35:36259–36269, 2022. 19

  2. [2]

    Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011

    Nicole B¨ auerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria.Math- ematical Methods of Operations Research, 74(3):361–379, 2011

  3. [3]

    More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014

    Nicole B¨ auerle and Ulrich Rieder. More risk-sensitive markov decision processes.Mathematics of Operations Research, 39(1):105–120, 2014

  4. [4]

    Data-driven stochastic programming using phi-divergences

    G¨ uzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pages 1–19. Informs, 2015

  5. [5]

    Decomposition and partitioning methods for multistage stochastic linear programs

    John R Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations research, 33(5):989–1007, 1985

  6. [6]

    Quantifying distributional model risk via optimal transport

    Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019

  7. [7]

    Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014

    Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for cvar optimization in mdps.Advances in neural information processing systems, 27, 2014

  8. [8]

    Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015

    Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach.Advances in neural information processing systems, 28, 2015

  9. [9]

    Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022

    Rui Ding and Eugene A Feinberg. Sequential optimization of cvar.arXiv e-prints, pages arXiv–2211, 2022

  10. [10]

    Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020

    Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforce- ment learning: Near-optimal risk-sample tradeoff in regret.Advances in Neural Information Processing Systems, 33:22384–22395, 2020

  11. [11]

    Distributionally robust stochastic optimization with wasserstein distance

    Rui Gao and Anton Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655, 2023

  12. [12]

    On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023

    Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes.Advances in Neural Information Processing Systems, 36:51734–51757, 2023

  13. [13]

    Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

    Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

  14. [14]

    Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022

    Guanghui Lan. Complexity of stochastic dual dynamic programming.Mathematical Programming, 191(2):717–754, 2022

  15. [15]

    First-Order Policy Optimization for Robust Markov Decision Process

    Yan Li, Guanghui Lan, and Tuo Zhao. First-order policy optimization for robust markov decision process.arXiv preprint arXiv:2209.10579, 2022

  16. [16]

    Rectangularity and duality of distributionally robust markov decision processes: Y

    Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust markov decision processes: Y. li, a. shapiro.Mathematical Programming, pages 1–42, 2025

  17. [17]

    Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018

    Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018

  18. [18]

    Reward redistribu- tion for cvar mdps using a bellman operator on l-infinity.arXiv preprint arXiv:2602.03778, 2026

    Aneri Muni, Vincent Taboga, Esther Derman, Pierre-Luc Bacon, and Erick Delage. Reward redistribu- tion for cvar mdps using a bellman operator on l-infinity.arXiv preprint arXiv:2602.03778, 2026

  19. [19]

    Risk-sensitive reward-free reinforcement learning with cvar

    Xinyi Ni, Guanlin Liu, and Lifeng Lai. Risk-sensitive reward-free reinforcement learning with cvar. In Forty-first International Conference on Machine Learning, 2024

  20. [20]

    Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

    Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005. 20

  21. [21]

    Sample complexity of robust reinforcement learning with a generative model

    Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–

  22. [22]

    Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991

    Mario VF Pereira and Leontina MVG Pinto. Multi-stage stochastic optimization applied to energy planning.Mathematical programming, 52(1):359–375, 1991

  23. [23]

    John Wiley & Sons, 2014

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  24. [24]

    Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010

    Andrzej Ruszczy´ nski. Risk-averse dynamic programming for markov decision processes.Mathematical programming, 125(2):235–261, 2010

  25. [25]

    On a time consistency concept in risk averse multistage stochastic programming

    Alexander Shapiro. On a time consistency concept in risk averse multistage stochastic programming. Operations Research Letters, 37(3):143–147, 2009

  26. [26]

    Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

    Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

  27. [27]

    SIAM, 2021

    Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic programming: modeling and theory. SIAM, 2021

  28. [28]

    Risk-averse formulations of stochastic optimal control and markov decision processes.arXiv preprint arXiv:2505.16651, 2025

    Alexander Shapiro and Yan Li. Risk-averse formulations of stochastic optimal control and markov decision processes.arXiv preprint arXiv:2505.16651, 2025

  29. [29]

    Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

    Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model.arXiv preprint arXiv:1806.01492, 2018

  30. [30]

    Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar

    Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learn- ing with cvar. InInternational Conference on Machine Learning, pages 35864–35907. PMLR, 2023

  31. [31]

    InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15503–15514

    Kaiwen Wang, Dawen Liang, Nathan Kallus, and Wen Sun. A reductions approach to risk-sensitive reinforcement learning with optimized certainty equivalents.arXiv preprint arXiv:2403.06323, 2024

  32. [32]

    Policy gradient in robust mdps with global convergence guarantee

    Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust mdps with global convergence guarantee. InInternational conference on machine learning, pages 35763–35797. PMLR, 2023

  33. [33]

    A finite sample complexity bound for distributionally robust q-learning

    Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InInternational conference on artificial intelligence and statistics, pages 3370–3398. PMLR, 2023

  34. [34]

    Policy gradient method for robust reinforcement learning

    Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational conference on machine learning, pages 23484–23526. PMLR, 2022

  35. [35]

    Risk-sensitive markov decision process and learning under general utility functions.arXiv preprint arXiv:2311.13589, 2023

    Zhengqi Wu and Renyuan Xu. Risk-sensitive markov decision process and learning under general utility functions.arXiv preprint arXiv:2311.13589, 2023. A Supplementary Proofs Lemma A.1.Let (Ω,F, P) be a measurable space, andX: Ω→[0, B] beF-measurable whereB >0. Consider R(X) = sup ζ∈D Z Ω X(ω)ζ(ω)dP(ω),s.t. ζ∈D, Z Ω ϕ(ζ(w))dP(w)≤τ.(A.1) whereDdenotes the se...