Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping
Pith reviewed 2026-05-25 00:15 UTC · model grok-4.3
The pith
State augmentation and isotopic lumping enable exact estimation of mean-variance and exponential utility risks in MDPs with stochastic rewards and randomized policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With the aid of the state-augmentation transformation (SAT), the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability.
What carries the argument
State-augmentation transformation (SAT) that converts the MDP into an equivalent process where risks are estimable, combined with isotopic-state lumping that exploits the structure of the transformed transition probabilities to shrink the state space without altering the risk values.
If this is right
- Both mean-variance risk and exponential utility risk become estimable under stochastic transition rewards and randomized policies.
- The state-space growth caused by augmentation is offset by lumping without introducing approximation error into the risk estimates.
- A naive simplification of the reward distribution produces observable errors that the SAT-plus-lumping procedure avoids.
- The procedure is illustrated to be valid on concrete numerical examples for the two risks.
Where Pith is reading between the lines
- The same transformation-plus-lumping pattern may apply to other law-invariant risk measures beyond the two examined.
- If similar isotopic structure can be identified after augmentation, the approach could scale risk estimation to larger MDPs than direct methods allow.
- The technique offers a model-reduction route that keeps exact risk semantics rather than approximating them.
Load-bearing premise
The transformed transition probabilities have a structure that permits isotopic states to be identified and lumped while exactly preserving the original risk values.
What would settle it
Compute the mean-variance or exponential utility risk on the augmented chain before and after applying the proposed isotopic lumping; any nonzero difference in the risk values would show that the lumping step does not preserve exactness.
Figures
read the original abstract
Variance plays a crucial role in risk-sensitive reinforcement learning, and most risk measures can be analyzed via variance. In this paper, we consider two law-invariant risks as examples: mean-variance risk and exponential utility risk. With the aid of the state-augmentation transformation (SAT), we show that, the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability. In the numerical experiment, we illustrate state lumping in the SAT, errors from a naive reward simplification, and the validity of the SAT for the two risk estimations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a state-augmentation transformation (SAT) enables exact estimation of mean-variance and exponential-utility risks in MDPs with stochastic transition-based rewards and randomized policies; a novel isotopic-state definition then permits state lumping that preserves these exact risk values by exploiting the structure of the transformed transition probabilities, with validity illustrated in a numerical experiment.
Significance. If the isotopic lumping is proven to preserve the exact nonlinear risk values without approximation, the combination of SAT and lumping would supply a practical, non-approximate method for computing variance-based risks in MDPs whose state space would otherwise explode, advancing risk-sensitive RL for stochastic rewards and randomized policies.
major comments (2)
- [Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.
- [Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.
minor comments (1)
- [Abstract] Abstract: the clause 'we show that, the two risks' contains an extraneous comma after 'that'.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract (central claim on exact estimation via lumping): the manuscript asserts that isotopic lumping preserves exact mean-variance (quadratic) and exponential-utility (strictly convex) values, yet provides no proof that reward averaging under lumping commutes with these nonlinear functionals for arbitrary randomized policies and stochastic rewards. The special structure of the SAT transition probabilities is invoked but not shown to guarantee equality of the risk measures post-lumping.
Authors: The commutation property follows from the isotopic state definition and the structure of the SAT transition kernel, as derived in Section 3 of the manuscript. However, we agree that an explicit statement of how averaging commutes with the nonlinear functionals (for both risk measures and randomized policies) would strengthen the presentation. We will add a dedicated remark or short proof sketch in the revision. revision: yes
-
Referee: [Numerical experiment] Numerical experiment (abstract): the only empirical support is described as 'illustrat[ing] ... the validity of the SAT,' but the abstract supplies no error metrics, baseline comparisons, quantitative tables, or derivation steps. This leaves the load-bearing validation of both SAT and lumping uninspectable.
Authors: The full experimental results, including error metrics and comparisons to naive reward averaging, appear in Section 4. We agree the abstract is overly terse on this point and will revise it to include a brief quantitative summary of the observed errors and validity checks. revision: yes
Circularity Check
No circularity: derivation uses external SAT plus independent lumping definition
full rationale
The paper's central construction begins with an external state-augmentation transformation (SAT) and then introduces a novel definition of isotopic states based on the transformed transition probabilities. No equation or definition in the abstract or description reduces the claimed exact preservation of mean-variance or exponential-utility risks to a quantity already fitted or defined inside the same paper; the lumping rule is presented as a new structural property rather than a self-referential fit. No self-citation chain is invoked to justify uniqueness or to smuggle an ansatz, and the numerical experiments are described as validation rather than as the source of the risk values themselves. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On law invariant coherent risk measures,
S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics , pp. 83–95, Springer, 2001
work page 2001
-
[2]
Q-learning for risk-sensitive control,
V . S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002
work page 2002
-
[3]
A comprehensive survey on safe rein- forcement learning,
J. Garc ´ıa and F. Fern ´andez, “A comprehensive survey on safe rein- forcement learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015
work page 2015
-
[4]
Risk-aware Q-learning for Markov decision processes,
W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC) , pp. 4928–4933, 2017
work page 2017
-
[5]
Risk- constrained reinforcement learning with percentile risk criteria,
Y . Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk- constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017
work page 2017
-
[6]
Safe model-based reinforcement learning with stability guarantees,
F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017
work page 2017
-
[7]
State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
S. Ma and J. Y . Yu, “State-augmentation transformations for risk- sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Robust control of Markov decision processes with uncertain transition matrices,
A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research , vol. 53, no. 5, pp. 780–798, 2005
work page 2005
-
[9]
Risk-averse dynamic programming for Markov deci- sion processes,
A. Ruszczy ´nski, “Risk-averse dynamic programming for Markov deci- sion processes,” Mathematical Programming, vol. 125, no. 2, pp. 235– 261, 2010
work page 2010
-
[10]
Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,
D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Applications , vol. 56, no. 1, pp. 1–29, 1988
work page 1988
-
[11]
Mean-variance tradeoffs in an undiscounted MDP,
M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994
work page 1994
-
[12]
Mean-variance optimization in Markov decision processes,
S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Confer- ence on Machine Learning (ICML) , pp. 1–22, 2011
work page 2011
-
[13]
The newsboy problem under alternative optimization objectives,
H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society , vol. 31, no. 6, pp. 525–535, 1980
work page 1980
-
[14]
Mean-variance analysis for the newsvendor problem,
T.-M. Choi, D. Li, and H. Yan, “Mean-variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169– 1180, 2008
work page 2008
-
[15]
Supply chain risk analysis with mean- variance models: A technical review,
C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean- variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016
work page 2016
-
[16]
Risk-sensitive Markov decision processes,
R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972
work page 1972
-
[17]
Discounted MDPs: Distribution func- tions and exponential utility maximization,
K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution func- tions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987
work page 1987
-
[18]
Altman, Constrained Markov Decision Processes
E. Altman, Constrained Markov Decision Processes. CRC Press, 1999
work page 1999
-
[19]
Model minimization in hierarchical reinforcement learning,
B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation , pp. 196–211, Springer, 2002
work page 2002
-
[20]
J. G. Kemeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, 1976
work page 1976
-
[21]
A markovian function of a markov chain,
C. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112– 1122, 1958
work page 1958
-
[22]
P. G. Harrison and N. M. Patel, Performance modelling of communi- cation networks and computer architectures (International Computer S. Addison-Wesley Longman Publishing Co., Inc., 1992
work page 1992
-
[23]
The variance of discounted Markov decision processes,
M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982
work page 1982
-
[24]
Mean-variance optimization of discrete time discounted markov decision processes,
L. Xia, “Mean-variance optimization of discrete time discounted markov decision processes,” Automatica, vol. 88, pp. 76–82, 2018
work page 2018
-
[25]
Risk-sensitive reinforcement learning,
Y . Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298– 1328, 2014
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.