pith. sign in

arxiv: 1907.05231 · v1 · pith:6M4BSOJGnew · submitted 2019-07-09 · 💻 cs.LG · cs.AI· stat.ML

Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping

Pith reviewed 2026-05-25 00:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords variance-based riskMarkov decision processesstate augmentationstate lumpingisotopic statesmean-variance riskexponential utility riskrisk-sensitive reinforcement learning
0
0 comments X

The pith

State augmentation and isotopic lumping enable exact estimation of mean-variance and exponential utility risks in MDPs with stochastic rewards and randomized policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that two law-invariant risks can be computed exactly in MDPs that include both stochastic transition-based rewards and randomized policies. It does so by first applying a state-augmentation transformation that restructures the problem into a form where the risks become estimable, then defining isotopic states whose lumping reduces the enlarged space while leaving the risk values unchanged. A sympathetic reader would care because most practical risk-sensitive reinforcement learning settings involve precisely these features, yet standard variance calculations break down without such a transformation and reduction step. The numerical experiments confirm that the combined procedure works for the chosen risks and that naive simplifications introduce measurable errors.

Core claim

With the aid of the state-augmentation transformation (SAT), the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability.

What carries the argument

State-augmentation transformation (SAT) that converts the MDP into an equivalent process where risks are estimable, combined with isotopic-state lumping that exploits the structure of the transformed transition probabilities to shrink the state space without altering the risk values.

If this is right

  • Both mean-variance risk and exponential utility risk become estimable under stochastic transition rewards and randomized policies.
  • The state-space growth caused by augmentation is offset by lumping without introducing approximation error into the risk estimates.
  • A naive simplification of the reward distribution produces observable errors that the SAT-plus-lumping procedure avoids.
  • The procedure is illustrated to be valid on concrete numerical examples for the two risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transformation-plus-lumping pattern may apply to other law-invariant risk measures beyond the two examined.
  • If similar isotopic structure can be identified after augmentation, the approach could scale risk estimation to larger MDPs than direct methods allow.
  • The technique offers a model-reduction route that keeps exact risk semantics rather than approximating them.

Load-bearing premise

The transformed transition probabilities have a structure that permits isotopic states to be identified and lumped while exactly preserving the original risk values.

What would settle it

Compute the mean-variance or exponential utility risk on the augmented chain before and after applying the proposed isotopic lumping; any nonzero difference in the risk values would show that the lumping step does not preserve exactness.

Figures

Figures reproduced from arXiv: 1907.05231 by Jia Yuan Yu, Shuai Ma.

Figure 1
Figure 1. Figure 1: A toy example with two states and two actions. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The transformed Markov process with a deterministic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The comparison among the empirical mean-variance [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The comparison among the empirical exponential [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Variance plays a crucial role in risk-sensitive reinforcement learning, and most risk measures can be analyzed via variance. In this paper, we consider two law-invariant risks as examples: mean-variance risk and exponential utility risk. With the aid of the state-augmentation transformation (SAT), we show that, the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability. In the numerical experiment, we illustrate state lumping in the SAT, errors from a naive reward simplification, and the validity of the SAT for the two risk estimations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that a state-augmentation transformation (SAT) enables exact estimation of mean-variance and exponential-utility risks in MDPs with stochastic transition-based rewards and randomized policies; a novel isotopic-state definition then permits state lumping that preserves these exact risk values by exploiting the structure of the transformed transition probabilities, with validity illustrated in a numerical experiment.

Significance. If the isotopic lumping is proven to preserve the exact nonlinear risk values without approximation, the combination of SAT and lumping would supply a practical, non-approximate method for computing variance-based risks in MDPs whose state space would otherwise explode, advancing risk-sensitive RL for stochastic rewards and randomized policies.

major comments (2)
  1. [Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.
  2. [Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.
minor comments (1)
  1. [Abstract] Abstract: the clause 'we show that, the two risks' contains an extraneous comma after 'that'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.

    Authors: The commutation property follows from the isotopic state definition and the structure of the SAT transition kernel, as derived in Section 3 of the manuscript. However, we agree that an explicit statement of how averaging commutes with the nonlinear functionals (for both risk measures and randomized policies) would strengthen the presentation. We will add a dedicated remark or short proof sketch in the revision. revision: yes

  2. Referee: [Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.

    Authors: The full experimental results, including error metrics and comparisons to naive reward averaging, appear in Section 4. We agree the abstract is overly terse on this point and will revise it to include a brief quantitative summary of the observed errors and validity checks. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external SAT plus independent lumping definition

full rationale

The paper's central construction begins with an external state-augmentation transformation (SAT) and then introduces a novel definition of isotopic states based on the transformed transition probabilities. No equation or definition in the abstract or description reduces the claimed exact preservation of mean-variance or exponential-utility risks to a quantity already fitted or defined inside the same paper; the lumping rule is presented as a new structural property rather than a self-referential fit. No self-citation chain is invoked to justify uniqueness or to smuggle an ansatz, and the numerical experiments are described as validation rather than as the source of the risk values themselves. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or sections from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5652 in / 998 out tokens · 37303 ms · 2026-05-25T00:15:09.690454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    On law invariant coherent risk measures,

    S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics , pp. 83–95, Springer, 2001

  2. [2]

    Q-learning for risk-sensitive control,

    V . S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002

  3. [3]

    A comprehensive survey on safe rein- forcement learning,

    J. Garc ´ıa and F. Fern ´andez, “A comprehensive survey on safe rein- forcement learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

  4. [4]

    Risk-aware Q-learning for Markov decision processes,

    W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC) , pp. 4928–4933, 2017

  5. [5]

    Risk- constrained reinforcement learning with percentile risk criteria,

    Y . Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk- constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

  6. [6]

    Safe model-based reinforcement learning with stability guarantees,

    F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017

  7. [7]

    State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

    S. Ma and J. Y . Yu, “State-augmentation transformations for risk- sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

  8. [8]

    Robust control of Markov decision processes with uncertain transition matrices,

    A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research , vol. 53, no. 5, pp. 780–798, 2005

  9. [9]

    Risk-averse dynamic programming for Markov deci- sion processes,

    A. Ruszczy ´nski, “Risk-averse dynamic programming for Markov deci- sion processes,” Mathematical Programming, vol. 125, no. 2, pp. 235– 261, 2010

  10. [10]

    Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,

    D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Applications , vol. 56, no. 1, pp. 1–29, 1988

  11. [11]

    Mean-variance tradeoffs in an undiscounted MDP,

    M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994

  12. [12]

    Mean-variance optimization in Markov decision processes,

    S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Confer- ence on Machine Learning (ICML) , pp. 1–22, 2011

  13. [13]

    The newsboy problem under alternative optimization objectives,

    H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society , vol. 31, no. 6, pp. 525–535, 1980

  14. [14]

    Mean-variance analysis for the newsvendor problem,

    T.-M. Choi, D. Li, and H. Yan, “Mean-variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169– 1180, 2008

  15. [15]

    Supply chain risk analysis with mean- variance models: A technical review,

    C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean- variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

  16. [16]

    Risk-sensitive Markov decision processes,

    R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972

  17. [17]

    Discounted MDPs: Distribution func- tions and exponential utility maximization,

    K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution func- tions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987

  18. [18]

    Altman, Constrained Markov Decision Processes

    E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

  19. [19]

    Model minimization in hierarchical reinforcement learning,

    B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation , pp. 196–211, Springer, 2002

  20. [20]

    J. G. Kemeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, 1976

  21. [21]

    A markovian function of a markov chain,

    C. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112– 1122, 1958

  22. [22]

    P. G. Harrison and N. M. Patel, Performance modelling of communi- cation networks and computer architectures (International Computer S. Addison-Wesley Longman Publishing Co., Inc., 1992

  23. [23]

    The variance of discounted Markov decision processes,

    M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

  24. [24]

    Mean-variance optimization of discrete time discounted markov decision processes,

    L. Xia, “Mean-variance optimization of discrete time discounted markov decision processes,” Automatica, vol. 88, pp. 76–82, 2018

  25. [25]

    Risk-sensitive reinforcement learning,

    Y . Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298– 1328, 2014