Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping

Jia Yuan Yu; Shuai Ma

arxiv: 1907.05231 · v1 · pith:6M4BSOJGnew · submitted 2019-07-09 · 💻 cs.LG · cs.AI· stat.ML

Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping

Shuai Ma , Jia Yuan Yu This is my paper

Pith reviewed 2026-05-25 00:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords variance-based riskMarkov decision processesstate augmentationstate lumpingisotopic statesmean-variance riskexponential utility riskrisk-sensitive reinforcement learning

0 comments

The pith

State augmentation and isotopic lumping enable exact estimation of mean-variance and exponential utility risks in MDPs with stochastic rewards and randomized policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that two law-invariant risks can be computed exactly in MDPs that include both stochastic transition-based rewards and randomized policies. It does so by first applying a state-augmentation transformation that restructures the problem into a form where the risks become estimable, then defining isotopic states whose lumping reduces the enlarged space while leaving the risk values unchanged. A sympathetic reader would care because most practical risk-sensitive reinforcement learning settings involve precisely these features, yet standard variance calculations break down without such a transformation and reduction step. The numerical experiments confirm that the combined procedure works for the chosen risks and that naive simplifications introduce measurable errors.

Core claim

With the aid of the state-augmentation transformation (SAT), the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability.

What carries the argument

State-augmentation transformation (SAT) that converts the MDP into an equivalent process where risks are estimable, combined with isotopic-state lumping that exploits the structure of the transformed transition probabilities to shrink the state space without altering the risk values.

If this is right

Both mean-variance risk and exponential utility risk become estimable under stochastic transition rewards and randomized policies.
The state-space growth caused by augmentation is offset by lumping without introducing approximation error into the risk estimates.
A naive simplification of the reward distribution produces observable errors that the SAT-plus-lumping procedure avoids.
The procedure is illustrated to be valid on concrete numerical examples for the two risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformation-plus-lumping pattern may apply to other law-invariant risk measures beyond the two examined.
If similar isotopic structure can be identified after augmentation, the approach could scale risk estimation to larger MDPs than direct methods allow.
The technique offers a model-reduction route that keeps exact risk semantics rather than approximating them.

Load-bearing premise

The transformed transition probabilities have a structure that permits isotopic states to be identified and lumped while exactly preserving the original risk values.

What would settle it

Compute the mean-variance or exponential utility risk on the augmented chain before and after applying the proposed isotopic lumping; any nonzero difference in the risk values would show that the lumping step does not preserve exactness.

Figures

Figures reproduced from arXiv: 1907.05231 by Jia Yuan Yu, Shuai Ma.

**Figure 3.** Figure 3: The transformed Markov process with a deterministic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The comparison among the empirical mean-variance [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: The comparison among the empirical exponential [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Variance plays a crucial role in risk-sensitive reinforcement learning, and most risk measures can be analyzed via variance. In this paper, we consider two law-invariant risks as examples: mean-variance risk and exponential utility risk. With the aid of the state-augmentation transformation (SAT), we show that, the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability. In the numerical experiment, we illustrate state lumping in the SAT, errors from a naive reward simplification, and the validity of the SAT for the two risk estimations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a state-augmentation trick plus isotopic lumping to estimate two specific nonlinear risks exactly in MDPs with stochastic rewards, but the lumping step needs a clear proof that it preserves the risk values rather than approximates them.

read the letter

The central claim is that a state-augmentation transformation turns mean-variance and exponential-utility risk estimation into a problem that can be solved on a lumped chain of isotopic states, and that this works exactly even with stochastic transition rewards and randomized policies. That is the one thing a colleague should take away first. The isotopic-state definition is new in this context and is tailored to the structure of the transformed transitions, which is a reasonable incremental step inside the state-augmentation line of work. The numerical experiment is presented as validation that the lumping works and that naive reward averaging does not, which at least shows the authors checked the practical behavior on an example. Those are the parts that are actually new and that the paper does cleanly. The soft spot is the preservation argument. Both risk measures are nonlinear functionals of the return law. When states are lumped, their individual reward realizations are replaced by an average; nothing in the abstract or the stress-test note shows that this averaging commutes with the risk functional for every admissible policy. If that step only holds under extra conditions that are not stated, the estimator becomes approximate rather than exact. The experiment description supplies no error numbers, no baseline comparisons, and no derivation steps, so it is impossible to judge how large any discrepancy is. The paper is aimed at people already working on risk-sensitive RL inside MDPs that have stochastic rewards. A reader who needs a computational device for exactly those two risk measures could extract the transformation and the lumping rule and test them themselves. It is coherent enough on its own terms to deserve a serious referee who can check the commutation claim and the experiment details rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that a state-augmentation transformation (SAT) enables exact estimation of mean-variance and exponential-utility risks in MDPs with stochastic transition-based rewards and randomized policies; a novel isotopic-state definition then permits state lumping that preserves these exact risk values by exploiting the structure of the transformed transition probabilities, with validity illustrated in a numerical experiment.

Significance. If the isotopic lumping is proven to preserve the exact nonlinear risk values without approximation, the combination of SAT and lumping would supply a practical, non-approximate method for computing variance-based risks in MDPs whose state space would otherwise explode, advancing risk-sensitive RL for stochastic rewards and randomized policies.

major comments (2)

[Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.
[Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.

minor comments (1)

[Abstract] Abstract: the clause 'we show that, the two risks' contains an extraneous comma after 'that'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.

Authors: The commutation property follows from the isotopic state definition and the structure of the SAT transition kernel, as derived in Section 3 of the manuscript. However, we agree that an explicit statement of how averaging commutes with the nonlinear functionals (for both risk measures and randomized policies) would strengthen the presentation. We will add a dedicated remark or short proof sketch in the revision. revision: yes
Referee: [Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.

Authors: The full experimental results, including error metrics and comparisons to naive reward averaging, appear in Section 4. We agree the abstract is overly terse on this point and will revise it to include a brief quantitative summary of the observed errors and validity checks. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external SAT plus independent lumping definition

full rationale

The paper's central construction begins with an external state-augmentation transformation (SAT) and then introduces a novel definition of isotopic states based on the transformed transition probabilities. No equation or definition in the abstract or description reduces the claimed exact preservation of mean-variance or exponential-utility risks to a quantity already fitted or defined inside the same paper; the lumping rule is presented as a new structural property rather than a self-referential fit. No self-citation chain is invoked to justify uniqueness or to smuggle an ansatz, and the numerical experiments are described as validation rather than as the source of the risk values themselves. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or sections from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5652 in / 998 out tokens · 37303 ms · 2026-05-25T00:15:09.690454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

On law invariant coherent risk measures,

S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics , pp. 83–95, Springer, 2001

work page 2001
[2]

Q-learning for risk-sensitive control,

V . S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002

work page 2002
[3]

A comprehensive survey on safe rein- forcement learning,

J. Garc ´ıa and F. Fern ´andez, “A comprehensive survey on safe rein- forcement learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

work page 2015
[4]

Risk-aware Q-learning for Markov decision processes,

W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC) , pp. 4928–4933, 2017

work page 2017
[5]

Risk- constrained reinforcement learning with percentile risk criteria,

Y . Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk- constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

work page 2017
[6]

Safe model-based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017

work page 2017
[7]

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

S. Ma and J. Y . Yu, “State-augmentation transformations for risk- sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Robust control of Markov decision processes with uncertain transition matrices,

A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research , vol. 53, no. 5, pp. 780–798, 2005

work page 2005
[9]

Risk-averse dynamic programming for Markov deci- sion processes,

A. Ruszczy ´nski, “Risk-averse dynamic programming for Markov deci- sion processes,” Mathematical Programming, vol. 125, no. 2, pp. 235– 261, 2010

work page 2010
[10]

Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,

D. J. White, “Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,” Journal of Optimization Theory and Applications , vol. 56, no. 1, pp. 1–29, 1988

work page 1988
[11]

Mean-variance tradeoffs in an undiscounted MDP,

M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994

work page 1994
[12]

Mean-variance optimization in Markov decision processes,

S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Confer- ence on Machine Learning (ICML) , pp. 1–22, 2011

work page 2011
[13]

The newsboy problem under alternative optimization objectives,

H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society , vol. 31, no. 6, pp. 525–535, 1980

work page 1980
[14]

Mean-variance analysis for the newsvendor problem,

T.-M. Choi, D. Li, and H. Yan, “Mean-variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169– 1180, 2008

work page 2008
[15]

Supply chain risk analysis with mean- variance models: A technical review,

C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean- variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

work page 2016
[16]

Risk-sensitive Markov decision processes,

R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972

work page 1972
[17]

Discounted MDPs: Distribution func- tions and exponential utility maximization,

K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution func- tions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987

work page 1987
[18]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

work page 1999
[19]

Model minimization in hierarchical reinforcement learning,

B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation , pp. 196–211, Springer, 2002

work page 2002
[20]

J. G. Kemeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, 1976

work page 1976
[21]

A markovian function of a markov chain,

C. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112– 1122, 1958

work page 1958
[22]

P. G. Harrison and N. M. Patel, Performance modelling of communi- cation networks and computer architectures (International Computer S. Addison-Wesley Longman Publishing Co., Inc., 1992

work page 1992
[23]

The variance of discounted Markov decision processes,

M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

work page 1982
[24]

Mean-variance optimization of discrete time discounted markov decision processes,

L. Xia, “Mean-variance optimization of discrete time discounted markov decision processes,” Automatica, vol. 88, pp. 76–82, 2018

work page 2018
[25]

Risk-sensitive reinforcement learning,

Y . Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298– 1328, 2014

work page 2014

[1] [1]

On law invariant coherent risk measures,

S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics , pp. 83–95, Springer, 2001

work page 2001

[2] [2]

Q-learning for risk-sensitive control,

V . S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002

work page 2002

[3] [3]

A comprehensive survey on safe rein- forcement learning,

J. Garc ´ıa and F. Fern ´andez, “A comprehensive survey on safe rein- forcement learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

work page 2015

[4] [4]

Risk-aware Q-learning for Markov decision processes,

W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC) , pp. 4928–4933, 2017

work page 2017

[5] [5]

Risk- constrained reinforcement learning with percentile risk criteria,

Y . Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk- constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

work page 2017

[6] [6]

Safe model-based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017

work page 2017

[7] [7]

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

S. Ma and J. Y . Yu, “State-augmentation transformations for risk- sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Robust control of Markov decision processes with uncertain transition matrices,

A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research , vol. 53, no. 5, pp. 780–798, 2005

work page 2005

[9] [9]

Risk-averse dynamic programming for Markov deci- sion processes,

A. Ruszczy ´nski, “Risk-averse dynamic programming for Markov deci- sion processes,” Mathematical Programming, vol. 125, no. 2, pp. 235– 261, 2010

work page 2010

[10] [10]

Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,

D. J. White, “Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,” Journal of Optimization Theory and Applications , vol. 56, no. 1, pp. 1–29, 1988

work page 1988

[11] [11]

Mean-variance tradeoffs in an undiscounted MDP,

M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994

work page 1994

[12] [12]

Mean-variance optimization in Markov decision processes,

S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Confer- ence on Machine Learning (ICML) , pp. 1–22, 2011

work page 2011

[13] [13]

The newsboy problem under alternative optimization objectives,

H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society , vol. 31, no. 6, pp. 525–535, 1980

work page 1980

[14] [14]

Mean-variance analysis for the newsvendor problem,

T.-M. Choi, D. Li, and H. Yan, “Mean-variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169– 1180, 2008

work page 2008

[15] [15]

Supply chain risk analysis with mean- variance models: A technical review,

C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean- variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

work page 2016

[16] [16]

Risk-sensitive Markov decision processes,

R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972

work page 1972

[17] [17]

Discounted MDPs: Distribution func- tions and exponential utility maximization,

K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution func- tions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987

work page 1987

[18] [18]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

work page 1999

[19] [19]

Model minimization in hierarchical reinforcement learning,

B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation , pp. 196–211, Springer, 2002

work page 2002

[20] [20]

J. G. Kemeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, 1976

work page 1976

[21] [21]

A markovian function of a markov chain,

C. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112– 1122, 1958

work page 1958

[22] [22]

P. G. Harrison and N. M. Patel, Performance modelling of communi- cation networks and computer architectures (International Computer S. Addison-Wesley Longman Publishing Co., Inc., 1992

work page 1992

[23] [23]

The variance of discounted Markov decision processes,

M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

work page 1982

[24] [24]

Mean-variance optimization of discrete time discounted markov decision processes,

L. Xia, “Mean-variance optimization of discrete time discounted markov decision processes,” Automatica, vol. 88, pp. 76–82, 2018

work page 2018

[25] [25]

Risk-sensitive reinforcement learning,

Y . Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298– 1328, 2014

work page 2014