pith. sign in

arxiv: 1907.04269 · v1 · pith:DZHZFPNZnew · submitted 2019-07-09 · 💻 cs.AI

A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

Pith reviewed 2026-05-25 00:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords risk-sensitive sequential decision makingMarkov decision processesneural network approximationmean-variance risk measuresstate augmentationdynamic parametersstochastic rewards
0
0 comments X

The pith

A neural network can approximate risk-sensitive policies for dynamic Markov decision processes by estimating risks from return variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a scheme to train a neural network that maps problem parameters to risk values and constrained policies for sequential decisions. It generates synthetic training data by sampling parameters over intervals to handle time-varying conditions, focusing on cases where objectives and constraints depend on the mean and variance of returns. This matters for a sympathetic reader because it offers a way to manage uncertainty in changing environments without solving each new instance from scratch. The approach rests on showing that variance can stand in for most risk measures and that state augmentation makes stochastic-reward problems tractable.

Core claim

For risk-sensitive problems in which the objective and constraints are or can be estimated by functions of the mean and variance of return, a neural network is trained as an approximator of the mapping from parameter space to the space of risk and policy with risk-sensitive constraints; most risk measures can be estimated using return variance; by virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and the proposed scheme is validated by a numerical experiment.

What carries the argument

Neural network approximator of the mapping from parameter space to risk-and-policy space, paired with the state-augmentation transformation for MDPs.

If this is right

  • Most risk measures can be estimated using return variance.
  • Markov decision processes with stochastic rewards become solvable in a risk-sensitive scenario through state augmentation.
  • Dynamic parameters are handled by sampling them within specified intervals to create synthetic training data.
  • The overall scheme produces usable policies and risk estimates as shown in a numerical experiment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approximation approach could reduce repeated optimization costs in applications where parameters drift slowly, such as resource allocation under changing demand.
  • If the trained network generalizes across unseen parameter values, it might support online policy updates without full retraining.
  • Similar neural approximations might extend to other risk proxies if the mean-variance assumption is relaxed in future work.

Load-bearing premise

The objective and constraints are, or can be estimated by, functions of the mean and variance of return.

What would settle it

A counterexample in which a standard risk measure used in sequential decisions cannot be estimated accurately from return variance alone would falsify the central reduction claim.

Figures

Figures reproduced from arXiv: 1907.04269 by Ahmet Satir, Jia Yuan Yu, Shuai Ma.

Figure 1
Figure 1. Figure 1: A dynamic risk evaluation scheme with NN and RL methods for [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The product (solid) and order (dashed) flows between the retailer [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The loss for training/validating a 3-layer network in 50 epochs. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risk-sensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a neural-network approximator that maps parameters of a dynamic process to risk-sensitive policies and risk values for sequential decision problems. It restricts attention to settings where objectives and constraints depend on (or can be estimated from) the mean and variance of returns, generates synthetic training data by sampling parameters over intervals, and invokes a state-augmentation transformation to convert MDPs with stochastic rewards into risk-sensitive form. The manuscript asserts three results: (i) most risk measures can be estimated from return variance, (ii) the state-augmentation step enables practical risk-sensitive solutions, and (iii) the scheme is validated by a numerical experiment.

Significance. If the mean-variance reduction were rigorously justified and the numerical results were shown to be reproducible, the work would supply a practical, data-driven method for handling time-varying risk-sensitive MDPs. The state-augmentation idea and the use of a single neural approximator for dynamic parameters are potentially useful engineering contributions, but only within the narrow class of problems already known to be mean-variance approximable.

major comments (3)
  1. [Abstract] Abstract: the claim that 'Most risk measures can be estimated using return variance' is stated without derivation, theorem, or citation. Standard tail-based measures (VaR, CVaR, spectral risk measures) depend on quantiles or the full distribution and are not functions of the first two moments; the manuscript therefore inherits an unverified scope limitation that affects both the state-augmentation step and the neural approximator.
  2. [Abstract] Abstract (claims i–iii): the three numbered assertions are presented as results shown by the paper, yet the provided text supplies neither proofs nor quantitative experimental outcomes (e.g., no reported policy values, risk estimates, or comparison metrics). This absence makes it impossible to assess whether the numerical experiment actually supports the claims.
  3. [Abstract] Abstract: the problem statement restricts attention to objectives 'or can be estimated by, functions of the mean and variance of return,' yet the broader claim (i) is not correspondingly scoped. The mismatch between the restricted setting and the general assertion is load-bearing for the paper's stated contribution.
minor comments (1)
  1. [Abstract] Abstract phrasing is awkward ('Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and will make corresponding revisions to the abstract for accuracy and consistency with the manuscript's scope.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Most risk measures can be estimated using return variance' is stated without derivation, theorem, or citation. Standard tail-based measures (VaR, CVaR, spectral risk measures) depend on quantiles or the full distribution and are not functions of the first two moments; the manuscript therefore inherits an unverified scope limitation that affects both the state-augmentation step and the neural approximator.

    Authors: We agree that the claim is overly broad, lacks derivation or citation, and does not hold for tail-based measures such as VaR or CVaR. The manuscript's method is restricted to risk measures estimable from mean and variance; we will remove or qualify this statement in the revised abstract to eliminate the overgeneralization. revision: yes

  2. Referee: [Abstract] Abstract (claims i–iii): the three numbered assertions are presented as results shown by the paper, yet the provided text supplies neither proofs nor quantitative experimental outcomes (e.g., no reported policy values, risk estimates, or comparison metrics). This absence makes it impossible to assess whether the numerical experiment actually supports the claims.

    Authors: The abstract is a high-level summary; the full manuscript describes the state-augmentation transformation and the numerical experiment. To strengthen substantiation, we will revise the abstract to include key quantitative outcomes or metrics from the experiment. revision: partial

  3. Referee: [Abstract] Abstract: the problem statement restricts attention to objectives 'or can be estimated by, functions of the mean and variance of return,' yet the broader claim (i) is not correspondingly scoped. The mismatch between the restricted setting and the general assertion is load-bearing for the paper's stated contribution.

    Authors: We acknowledge the inconsistency in scope. Claim (i) will be revised to align explicitly with the mean-variance restriction used throughout the problem statement and method. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper explicitly limits its scope to problems where objectives and constraints are (or can be estimated by) functions of mean and variance of return. The statement 'Most risk measures can be estimated using return variance' is asserted without a derivation, equation, or self-citation that reduces it to the paper's own inputs by construction. The state-augmentation transformation and neural approximator are presented as methods applicable within this scoped class, with validation via synthetic data and experiment. No load-bearing step matches the enumerated circularity patterns; the derivation remains self-contained against external benchmarks for the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits identification of additional parameters or entities; main assumption is the mean-variance estimation of risk.

axioms (1)
  • domain assumption The objective and constraints are, or can be estimated by, functions of the mean and variance of return
    Explicitly stated in the abstract as the basis for the approach and dataset generation.

pith-pipeline@v0.9.0 · 5675 in / 1244 out tokens · 30934 ms · 2026-05-25T00:23:39.950614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    Sustainable supply chain - supporting tools,

    K. Grzybowska and G. Kovcs, “Sustainable supply chain - supporting tools,” in 2014 Federated Conference on Computer Science and Information Systems, pp. 1321–1329, 2014

  2. [2]

    Franceschetti, Sustainable city logistics : fleet planning, routing and scheduling problems

    A. Franceschetti, Sustainable city logistics : fleet planning, routing and scheduling problems. PhD thesis, Technische Universiteit Eindhoven, 2015

  3. [3]

    Altman, Constrained Markov Decision Processes

    E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

  4. [4]

    Robust control of Markov decision processes with uncertain transition matrices,

    A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005. 17

  5. [5]

    Risk-averse dynamic programming for Markov decision processes,

    A. Ruszczy´ nski, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235–261, 2010

  6. [6]

    On law invariant coherent risk measures,

    S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001

  7. [7]

    Risk-sensitive Markov decision pro- cesses,

    R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision pro- cesses,” Management science, vol. 18, no. 7, pp. 356–369, 1972

  8. [8]

    Discounted MDPs: Distribution functions and exponential utility maximization,

    K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution functions and exponential utility maximization,” SIAM journal on control and opti- mization, vol. 25, no. 1, pp. 49–62, 1987

  9. [9]

    Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,

    D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Appli- cations, vol. 56, no. 1, pp. 1–29, 1988

  10. [10]

    Mean-variance tradeoffs in an undiscounted MDP,

    M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Opera- tions Research, vol. 42, no. 1, pp. 175–183, 1994

  11. [11]

    Mean-variance optimization in Markov de- cision processes,

    S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov de- cision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011

  12. [12]

    The newsboy problem under alternative optimization objec- tives,

    H.-S. Lau, “The newsboy problem under alternative optimization objec- tives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525– 535, 1980

  13. [13]

    Mean–variance analysis for the newsvendor problem,

    T.-M. Choi, D. Li, and H. Yan, “Mean–variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169–1180, 2008

  14. [14]

    Supply chain risk analysis with mean-variance models: A technical review,

    C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

  15. [15]

    Percentile performance criteria for limiting average Markov decision processes,

    J. A. Filar, D. Krass, K. W. Ross, and S. Member, “Percentile performance criteria for limiting average Markov decision processes,”IEEE Transactions on Automatic Control, vol. 40, no. I, pp. 2–10, 1995

  16. [16]

    Minimizing risk models in Markov decision process with policies depending on target values,

    C. Wu and Y. Lin, “Minimizing risk models in Markov decision process with policies depending on target values,” Journal of Mathematical Analysis and Applications, vol. 23, no. 1, pp. 47–67, 1999

  17. [17]

    S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability . Springer Science & Business Media, 2009

  18. [18]

    Dynamic coherent risk measures,

    F. Riedel, “Dynamic coherent risk measures,” Stochastic Processes and their Applications, vol. 112, no. 2, pp. 185–200, 2004. 18

  19. [19]

    Coherent measures of risk,

    P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 1–24, 1998

  20. [20]

    Transition-based versus state-based reward functions for MDPs with Value-at-Risk,

    S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” in Proceedings of the 55th Annual Aller- ton Conference on Communication, Control, and Computing (Allerton) , pp. 974–981, 2017

  21. [21]

    The variance of discounted Markov decision processes,

    M. J. Sobel, “The variance of discounted Markov decision processes,” Jour- nal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

  22. [22]

    State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

    S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

  23. [23]

    Q-learning for risk-sensitive control,

    V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Oper- ations Research, vol. 27, no. 2, pp. 294–311, 2002

  24. [24]

    Risk-sensitive re- inforcement learning,

    Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive re- inforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014

  25. [25]

    A comprehensive survey on safe reinforcement learning,

    J. Garc´ ıa and F. Fern´ andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437– 1480, 2015

  26. [26]

    Quantile Reinforcement Learning

    H. Gilbert and P. Weng, “Quantile reinforcement learning,” arXiv:1611.00862, 2016

  27. [27]

    Risk-aware Q-learning for Markov decision processes,

    W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017

  28. [28]

    Risk-constrained reinforcement learning with percentile risk criteria,

    Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Ma- chine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

  29. [29]

    Safe model- based reinforcement learning with stability guarantees,

    F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model- based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS) , pp. 908–918, 2017

  30. [30]

    Model minimization in hierarchical rein- forcement learning,

    B. Ravindran and A. G. Barto, “Model minimization in hierarchical rein- forcement learning,” in International Symposium on Abstraction, Reformu- lation, and Approximation , pp. 196–211, Springer, 2002

  31. [31]

    Approximation capabilities of multilayer feedforward net- works,

    K. Hornik, “Approximation capabilities of multilayer feedforward net- works,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991

  32. [32]

    Emerging techniques for enhancing the practical application of city logistics models,

    E. Taniguchi, R. G. Thompson, and T. Yamada, “Emerging techniques for enhancing the practical application of city logistics models,” Procedia- Social and Behavioral Sciences , vol. 39, pp. 3–18, 2012. 19

  33. [33]

    Cohen and A

    L. Cohen and A. Young, Multisourcing: Moving beyond outsourcing to achieve growth and agility . Harvard Business Press, 2006

  34. [34]

    A replenishment model for the supply-uncertainty problem,

    E. Mohebbi, “A replenishment model for the supply-uncertainty problem,” International Journal of Production Economics , vol. 87, pp. 25–37, 2004

  35. [35]

    A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,

    S. S. Ahiska, S. R. Appaji, R. E. King, and D. P. Warsing Jr, “A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,” International Journal of Production Economics, vol. 144, no. 2, pp. 485–496, 2013

  36. [36]

    Shen, Risk sensitive Markov decision processes

    Y. Shen, Risk sensitive Markov decision processes . PhD thesis, 01 2015

  37. [37]

    Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates

    E. Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates . PhD thesis, UPMC, Sorbonne Universites CNRS, 2015

  38. [38]

    Solving hidden-mode markov decision problems.,

    S. P.-M. Choi, N. L. Zhang, and D.-Y. Yeung, “Solving hidden-mode markov decision problems.,” in AISTATS, Citeseer, 2001

  39. [39]

    A lyapunov-based approach to safe reinforcement learning,

    Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 8092–8101, 2018

  40. [40]

    Risk-sensitive reinforcement learning applied to control under constraints,

    P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artificial Intelligence Research , vol. 24, pp. 81–108, 2005

  41. [41]

    Implicit Quantile Networks for Distributional Reinforcement Learning

    W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quan- tile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, 2018

  42. [42]

    QUOTA: The Quantile Option Architecture for Reinforcement Learning

    S. Zhang, B. Mavrin, H. Yao, L. Kong, and B. Liu, “Quota: The quantile option architecture for reinforcement learning,” arXiv preprint arXiv:1811.02073, 2018. 20