A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

Ahmet Satir; Jia Yuan Yu; Shuai Ma

arxiv: 1907.04269 · v1 · pith:DZHZFPNZnew · submitted 2019-07-09 · 💻 cs.AI

A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

Shuai Ma , Jia Yuan Yu , Ahmet Satir This is my paper

Pith reviewed 2026-05-25 00:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords risk-sensitive sequential decision makingMarkov decision processesneural network approximationmean-variance risk measuresstate augmentationdynamic parametersstochastic rewards

0 comments

The pith

A neural network can approximate risk-sensitive policies for dynamic Markov decision processes by estimating risks from return variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a scheme to train a neural network that maps problem parameters to risk values and constrained policies for sequential decisions. It generates synthetic training data by sampling parameters over intervals to handle time-varying conditions, focusing on cases where objectives and constraints depend on the mean and variance of returns. This matters for a sympathetic reader because it offers a way to manage uncertainty in changing environments without solving each new instance from scratch. The approach rests on showing that variance can stand in for most risk measures and that state augmentation makes stochastic-reward problems tractable.

Core claim

For risk-sensitive problems in which the objective and constraints are or can be estimated by functions of the mean and variance of return, a neural network is trained as an approximator of the mapping from parameter space to the space of risk and policy with risk-sensitive constraints; most risk measures can be estimated using return variance; by virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and the proposed scheme is validated by a numerical experiment.

What carries the argument

Neural network approximator of the mapping from parameter space to risk-and-policy space, paired with the state-augmentation transformation for MDPs.

If this is right

Most risk measures can be estimated using return variance.
Markov decision processes with stochastic rewards become solvable in a risk-sensitive scenario through state augmentation.
Dynamic parameters are handled by sampling them within specified intervals to create synthetic training data.
The overall scheme produces usable policies and risk estimates as shown in a numerical experiment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approximation approach could reduce repeated optimization costs in applications where parameters drift slowly, such as resource allocation under changing demand.
If the trained network generalizes across unseen parameter values, it might support online policy updates without full retraining.
Similar neural approximations might extend to other risk proxies if the mean-variance assumption is relaxed in future work.

Load-bearing premise

The objective and constraints are, or can be estimated by, functions of the mean and variance of return.

What would settle it

A counterexample in which a standard risk measure used in sequential decisions cannot be estimated accurately from return variance alone would falsify the central reduction claim.

Figures

Figures reproduced from arXiv: 1907.04269 by Ahmet Satir, Jia Yuan Yu, Shuai Ma.

**Figure 2.** Figure 2: The product (solid) and order (dashed) flows between the retailer [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: The loss for training/validating a 3-layer network in 50 epochs. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risk-sensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's assertion that most risk measures can be estimated from return variance is unsupported and does not hold for tail-based measures like VaR or CVaR.

read the letter

The core issue is that the paper claims most risk measures reduce to functions of return variance, yet provides no derivation, theorem, or reference for this. That reduction is false for quantile-based or spectral risk measures, which depend on the full tail rather than the first two moments. The work is explicitly scoped to problems where objectives and constraints are mean-variance functions, but then states a broader claim without grounding, creating an internal mismatch that the numerical experiment cannot fix. The actual contribution is a neural network trained to approximate the mapping from sampled parameters to risk-sensitive policies, combined with state augmentation to convert variance-aware MDPs into standard ones. Sampling parameters over intervals to handle time variation is a reasonable practical step for non-stationary settings. If the risk really is just mean and variance, the state-augmentation step could be useful for turning constrained problems into solvable MDPs. The soft spot is the unsupported premise in claim i; without external validation or a clear restriction to mean-variance cases only, the scheme inherits an unverified scope. The abstract states the three claims without showing results or proofs, and the reader's note correctly flags that the experiment does not compensate for a failed premise. This is for researchers already focused on mean-variance risk approximations in RL or operations research who need a quick neural mapping for dynamic parameters. Readers seeking a general risk-sensitive framework or formal guarantees will find the foundation too narrow and unproven. I would not send this to peer review until the risk-measure claim is either dropped or properly justified with evidence that stands on its own.

Referee Report

3 major / 1 minor

Summary. The paper proposes a neural-network approximator that maps parameters of a dynamic process to risk-sensitive policies and risk values for sequential decision problems. It restricts attention to settings where objectives and constraints depend on (or can be estimated from) the mean and variance of returns, generates synthetic training data by sampling parameters over intervals, and invokes a state-augmentation transformation to convert MDPs with stochastic rewards into risk-sensitive form. The manuscript asserts three results: (i) most risk measures can be estimated from return variance, (ii) the state-augmentation step enables practical risk-sensitive solutions, and (iii) the scheme is validated by a numerical experiment.

Significance. If the mean-variance reduction were rigorously justified and the numerical results were shown to be reproducible, the work would supply a practical, data-driven method for handling time-varying risk-sensitive MDPs. The state-augmentation idea and the use of a single neural approximator for dynamic parameters are potentially useful engineering contributions, but only within the narrow class of problems already known to be mean-variance approximable.

major comments (3)

[Abstract] Abstract: the claim that 'Most risk measures can be estimated using return variance' is stated without derivation, theorem, or citation. Standard tail-based measures (VaR, CVaR, spectral risk measures) depend on quantiles or the full distribution and are not functions of the first two moments; the manuscript therefore inherits an unverified scope limitation that affects both the state-augmentation step and the neural approximator.
[Abstract] Abstract (claims i–iii): the three numbered assertions are presented as results shown by the paper, yet the provided text supplies neither proofs nor quantitative experimental outcomes (e.g., no reported policy values, risk estimates, or comparison metrics). This absence makes it impossible to assess whether the numerical experiment actually supports the claims.
[Abstract] Abstract: the problem statement restricts attention to objectives 'or can be estimated by, functions of the mean and variance of return,' yet the broader claim (i) is not correspondingly scoped. The mismatch between the restricted setting and the general assertion is load-bearing for the paper's stated contribution.

minor comments (1)

[Abstract] Abstract phrasing is awkward ('Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and will make corresponding revisions to the abstract for accuracy and consistency with the manuscript's scope.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Most risk measures can be estimated using return variance' is stated without derivation, theorem, or citation. Standard tail-based measures (VaR, CVaR, spectral risk measures) depend on quantiles or the full distribution and are not functions of the first two moments; the manuscript therefore inherits an unverified scope limitation that affects both the state-augmentation step and the neural approximator.

Authors: We agree that the claim is overly broad, lacks derivation or citation, and does not hold for tail-based measures such as VaR or CVaR. The manuscript's method is restricted to risk measures estimable from mean and variance; we will remove or qualify this statement in the revised abstract to eliminate the overgeneralization. revision: yes
Referee: [Abstract] Abstract (claims i–iii): the three numbered assertions are presented as results shown by the paper, yet the provided text supplies neither proofs nor quantitative experimental outcomes (e.g., no reported policy values, risk estimates, or comparison metrics). This absence makes it impossible to assess whether the numerical experiment actually supports the claims.

Authors: The abstract is a high-level summary; the full manuscript describes the state-augmentation transformation and the numerical experiment. To strengthen substantiation, we will revise the abstract to include key quantitative outcomes or metrics from the experiment. revision: partial
Referee: [Abstract] Abstract: the problem statement restricts attention to objectives 'or can be estimated by, functions of the mean and variance of return,' yet the broader claim (i) is not correspondingly scoped. The mismatch between the restricted setting and the general assertion is load-bearing for the paper's stated contribution.

Authors: We acknowledge the inconsistency in scope. Claim (i) will be revised to align explicitly with the mean-variance restriction used throughout the problem statement and method. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper explicitly limits its scope to problems where objectives and constraints are (or can be estimated by) functions of mean and variance of return. The statement 'Most risk measures can be estimated using return variance' is asserted without a derivation, equation, or self-citation that reduces it to the paper's own inputs by construction. The state-augmentation transformation and neural approximator are presented as methods applicable within this scoped class, with validation via synthetic data and experiment. No load-bearing step matches the enumerated circularity patterns; the derivation remains self-contained against external benchmarks for the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits identification of additional parameters or entities; main assumption is the mean-variance estimation of risk.

axioms (1)

domain assumption The objective and constraints are, or can be estimated by, functions of the mean and variance of return
Explicitly stated in the abstract as the basis for the approach and dataset generation.

pith-pipeline@v0.9.0 · 5675 in / 1244 out tokens · 30934 ms · 2026-05-25T00:23:39.950614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

Sustainable supply chain - supporting tools,

K. Grzybowska and G. Kovcs, “Sustainable supply chain - supporting tools,” in 2014 Federated Conference on Computer Science and Information Systems, pp. 1321–1329, 2014

work page 2014
[2]

Franceschetti, Sustainable city logistics : ﬂeet planning, routing and scheduling problems

A. Franceschetti, Sustainable city logistics : ﬂeet planning, routing and scheduling problems. PhD thesis, Technische Universiteit Eindhoven, 2015

work page 2015
[3]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

work page 1999
[4]

Robust control of Markov decision processes with uncertain transition matrices,

A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005. 17

work page 2005
[5]

Risk-averse dynamic programming for Markov decision processes,

A. Ruszczy´ nski, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235–261, 2010

work page 2010
[6]

On law invariant coherent risk measures,

S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001

work page 2001
[7]

Risk-sensitive Markov decision pro- cesses,

R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision pro- cesses,” Management science, vol. 18, no. 7, pp. 356–369, 1972

work page 1972
[8]

Discounted MDPs: Distribution functions and exponential utility maximization,

K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution functions and exponential utility maximization,” SIAM journal on control and opti- mization, vol. 25, no. 1, pp. 49–62, 1987

work page 1987
[9]

Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,

D. J. White, “Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,” Journal of Optimization Theory and Appli- cations, vol. 56, no. 1, pp. 1–29, 1988

work page 1988
[10]

Mean-variance tradeoﬀs in an undiscounted MDP,

M. J. Sobel, “Mean-variance tradeoﬀs in an undiscounted MDP,” Opera- tions Research, vol. 42, no. 1, pp. 175–183, 1994

work page 1994
[11]

Mean-variance optimization in Markov de- cision processes,

S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov de- cision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011

work page 2011
[12]

The newsboy problem under alternative optimization objec- tives,

H.-S. Lau, “The newsboy problem under alternative optimization objec- tives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525– 535, 1980

work page 1980
[13]

Mean–variance analysis for the newsvendor problem,

T.-M. Choi, D. Li, and H. Yan, “Mean–variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169–1180, 2008

work page 2008
[14]

Supply chain risk analysis with mean-variance models: A technical review,

C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

work page 2016
[15]

Percentile performance criteria for limiting average Markov decision processes,

J. A. Filar, D. Krass, K. W. Ross, and S. Member, “Percentile performance criteria for limiting average Markov decision processes,”IEEE Transactions on Automatic Control, vol. 40, no. I, pp. 2–10, 1995

work page 1995
[16]

Minimizing risk models in Markov decision process with policies depending on target values,

C. Wu and Y. Lin, “Minimizing risk models in Markov decision process with policies depending on target values,” Journal of Mathematical Analysis and Applications, vol. 23, no. 1, pp. 47–67, 1999

work page 1999
[17]

S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability . Springer Science & Business Media, 2009

work page 2009
[18]

Dynamic coherent risk measures,

F. Riedel, “Dynamic coherent risk measures,” Stochastic Processes and their Applications, vol. 112, no. 2, pp. 185–200, 2004. 18

work page 2004
[19]

Coherent measures of risk,

P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 1–24, 1998

work page 1998
[20]

Transition-based versus state-based reward functions for MDPs with Value-at-Risk,

S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” in Proceedings of the 55th Annual Aller- ton Conference on Communication, Control, and Computing (Allerton) , pp. 974–981, 2017

work page 2017
[21]

The variance of discounted Markov decision processes,

M. J. Sobel, “The variance of discounted Markov decision processes,” Jour- nal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

work page 1982
[22]

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Q-learning for risk-sensitive control,

V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Oper- ations Research, vol. 27, no. 2, pp. 294–311, 2002

work page 2002
[24]

Risk-sensitive re- inforcement learning,

Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive re- inforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014

work page 2014
[25]

A comprehensive survey on safe reinforcement learning,

J. Garc´ ıa and F. Fern´ andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437– 1480, 2015

work page 2015
[26]

Quantile Reinforcement Learning

H. Gilbert and P. Weng, “Quantile reinforcement learning,” arXiv:1611.00862, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Risk-aware Q-learning for Markov decision processes,

W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017

work page 2017
[28]

Risk-constrained reinforcement learning with percentile risk criteria,

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Ma- chine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

work page 2017
[29]

Safe model- based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model- based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS) , pp. 908–918, 2017

work page 2017
[30]

Model minimization in hierarchical rein- forcement learning,

B. Ravindran and A. G. Barto, “Model minimization in hierarchical rein- forcement learning,” in International Symposium on Abstraction, Reformu- lation, and Approximation , pp. 196–211, Springer, 2002

work page 2002
[31]

Approximation capabilities of multilayer feedforward net- works,

K. Hornik, “Approximation capabilities of multilayer feedforward net- works,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991

work page 1991
[32]

Emerging techniques for enhancing the practical application of city logistics models,

E. Taniguchi, R. G. Thompson, and T. Yamada, “Emerging techniques for enhancing the practical application of city logistics models,” Procedia- Social and Behavioral Sciences , vol. 39, pp. 3–18, 2012. 19

work page 2012
[33]

Cohen and A

L. Cohen and A. Young, Multisourcing: Moving beyond outsourcing to achieve growth and agility . Harvard Business Press, 2006

work page 2006
[34]

A replenishment model for the supply-uncertainty problem,

E. Mohebbi, “A replenishment model for the supply-uncertainty problem,” International Journal of Production Economics , vol. 87, pp. 25–37, 2004

work page 2004
[35]

A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,

S. S. Ahiska, S. R. Appaji, R. E. King, and D. P. Warsing Jr, “A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,” International Journal of Production Economics, vol. 144, no. 2, pp. 485–496, 2013

work page 2013
[36]

Shen, Risk sensitive Markov decision processes

Y. Shen, Risk sensitive Markov decision processes . PhD thesis, 01 2015

work page 2015
[37]

Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates

E. Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates . PhD thesis, UPMC, Sorbonne Universites CNRS, 2015

work page 2015
[38]

Solving hidden-mode markov decision problems.,

S. P.-M. Choi, N. L. Zhang, and D.-Y. Yeung, “Solving hidden-mode markov decision problems.,” in AISTATS, Citeseer, 2001

work page 2001
[39]

A lyapunov-based approach to safe reinforcement learning,

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 8092–8101, 2018

work page 2018
[40]

Risk-sensitive reinforcement learning applied to control under constraints,

P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artiﬁcial Intelligence Research , vol. 24, pp. 81–108, 2005

work page 2005
[41]

Implicit Quantile Networks for Distributional Reinforcement Learning

W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quan- tile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

QUOTA: The Quantile Option Architecture for Reinforcement Learning

S. Zhang, B. Mavrin, H. Yao, L. Kong, and B. Liu, “Quota: The quantile option architecture for reinforcement learning,” arXiv preprint arXiv:1811.02073, 2018. 20

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Sustainable supply chain - supporting tools,

K. Grzybowska and G. Kovcs, “Sustainable supply chain - supporting tools,” in 2014 Federated Conference on Computer Science and Information Systems, pp. 1321–1329, 2014

work page 2014

[2] [2]

Franceschetti, Sustainable city logistics : ﬂeet planning, routing and scheduling problems

A. Franceschetti, Sustainable city logistics : ﬂeet planning, routing and scheduling problems. PhD thesis, Technische Universiteit Eindhoven, 2015

work page 2015

[3] [3]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes. CRC Press, 1999

work page 1999

[4] [4]

Robust control of Markov decision processes with uncertain transition matrices,

A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005. 17

work page 2005

[5] [5]

Risk-averse dynamic programming for Markov decision processes,

A. Ruszczy´ nski, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235–261, 2010

work page 2010

[6] [6]

On law invariant coherent risk measures,

S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001

work page 2001

[7] [7]

Risk-sensitive Markov decision pro- cesses,

R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision pro- cesses,” Management science, vol. 18, no. 7, pp. 356–369, 1972

work page 1972

[8] [8]

Discounted MDPs: Distribution functions and exponential utility maximization,

K.-J. Chung and M. J. Sobel, “Discounted MDPs: Distribution functions and exponential utility maximization,” SIAM journal on control and opti- mization, vol. 25, no. 1, pp. 49–62, 1987

work page 1987

[9] [9]

Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,

D. J. White, “Mean , variance , and probabilistic criteria in ﬁnite Markov decision processes : A review,” Journal of Optimization Theory and Appli- cations, vol. 56, no. 1, pp. 1–29, 1988

work page 1988

[10] [10]

Mean-variance tradeoﬀs in an undiscounted MDP,

M. J. Sobel, “Mean-variance tradeoﬀs in an undiscounted MDP,” Opera- tions Research, vol. 42, no. 1, pp. 175–183, 1994

work page 1994

[11] [11]

Mean-variance optimization in Markov de- cision processes,

S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov de- cision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011

work page 2011

[12] [12]

The newsboy problem under alternative optimization objec- tives,

H.-S. Lau, “The newsboy problem under alternative optimization objec- tives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525– 535, 1980

work page 1980

[13] [13]

Mean–variance analysis for the newsvendor problem,

T.-M. Choi, D. Li, and H. Yan, “Mean–variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 38, no. 5, pp. 1169–1180, 2008

work page 2008

[14] [14]

Supply chain risk analysis with mean-variance models: A technical review,

C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016

work page 2016

[15] [15]

Percentile performance criteria for limiting average Markov decision processes,

J. A. Filar, D. Krass, K. W. Ross, and S. Member, “Percentile performance criteria for limiting average Markov decision processes,”IEEE Transactions on Automatic Control, vol. 40, no. I, pp. 2–10, 1995

work page 1995

[16] [16]

Minimizing risk models in Markov decision process with policies depending on target values,

C. Wu and Y. Lin, “Minimizing risk models in Markov decision process with policies depending on target values,” Journal of Mathematical Analysis and Applications, vol. 23, no. 1, pp. 47–67, 1999

work page 1999

[17] [17]

S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability . Springer Science & Business Media, 2009

work page 2009

[18] [18]

Dynamic coherent risk measures,

F. Riedel, “Dynamic coherent risk measures,” Stochastic Processes and their Applications, vol. 112, no. 2, pp. 185–200, 2004. 18

work page 2004

[19] [19]

Coherent measures of risk,

P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 1–24, 1998

work page 1998

[20] [20]

Transition-based versus state-based reward functions for MDPs with Value-at-Risk,

S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” in Proceedings of the 55th Annual Aller- ton Conference on Communication, Control, and Computing (Allerton) , pp. 974–981, 2017

work page 2017

[21] [21]

The variance of discounted Markov decision processes,

M. J. Sobel, “The variance of discounted Markov decision processes,” Jour- nal of Applied Probability , vol. 19, no. 4, pp. 794–802, 1982

work page 1982

[22] [22]

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Q-learning for risk-sensitive control,

V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Oper- ations Research, vol. 27, no. 2, pp. 294–311, 2002

work page 2002

[24] [24]

Risk-sensitive re- inforcement learning,

Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive re- inforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014

work page 2014

[25] [25]

A comprehensive survey on safe reinforcement learning,

J. Garc´ ıa and F. Fern´ andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437– 1480, 2015

work page 2015

[26] [26]

Quantile Reinforcement Learning

H. Gilbert and P. Weng, “Quantile reinforcement learning,” arXiv:1611.00862, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Risk-aware Q-learning for Markov decision processes,

W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017

work page 2017

[28] [28]

Risk-constrained reinforcement learning with percentile risk criteria,

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Ma- chine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017

work page 2017

[29] [29]

Safe model- based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model- based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS) , pp. 908–918, 2017

work page 2017

[30] [30]

Model minimization in hierarchical rein- forcement learning,

B. Ravindran and A. G. Barto, “Model minimization in hierarchical rein- forcement learning,” in International Symposium on Abstraction, Reformu- lation, and Approximation , pp. 196–211, Springer, 2002

work page 2002

[31] [31]

Approximation capabilities of multilayer feedforward net- works,

K. Hornik, “Approximation capabilities of multilayer feedforward net- works,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991

work page 1991

[32] [32]

Emerging techniques for enhancing the practical application of city logistics models,

E. Taniguchi, R. G. Thompson, and T. Yamada, “Emerging techniques for enhancing the practical application of city logistics models,” Procedia- Social and Behavioral Sciences , vol. 39, pp. 3–18, 2012. 19

work page 2012

[33] [33]

Cohen and A

L. Cohen and A. Young, Multisourcing: Moving beyond outsourcing to achieve growth and agility . Harvard Business Press, 2006

work page 2006

[34] [34]

A replenishment model for the supply-uncertainty problem,

E. Mohebbi, “A replenishment model for the supply-uncertainty problem,” International Journal of Production Economics , vol. 87, pp. 25–37, 2004

work page 2004

[35] [35]

A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,

S. S. Ahiska, S. R. Appaji, R. E. King, and D. P. Warsing Jr, “A Markov decision process-based policy characterization approach for a stochastic in- ventory control problem with unreliable sourcing,” International Journal of Production Economics, vol. 144, no. 2, pp. 485–496, 2013

work page 2013

[36] [36]

Shen, Risk sensitive Markov decision processes

Y. Shen, Risk sensitive Markov decision processes . PhD thesis, 01 2015

work page 2015

[37] [37]

Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates

E. Hadoux, Markovian sequential decision-making in non-stationary en- vironments: application to argumentative debates . PhD thesis, UPMC, Sorbonne Universites CNRS, 2015

work page 2015

[38] [38]

Solving hidden-mode markov decision problems.,

S. P.-M. Choi, N. L. Zhang, and D.-Y. Yeung, “Solving hidden-mode markov decision problems.,” in AISTATS, Citeseer, 2001

work page 2001

[39] [39]

A lyapunov-based approach to safe reinforcement learning,

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 8092–8101, 2018

work page 2018

[40] [40]

Risk-sensitive reinforcement learning applied to control under constraints,

P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artiﬁcial Intelligence Research , vol. 24, pp. 81–108, 2005

work page 2005

[41] [41]

Implicit Quantile Networks for Distributional Reinforcement Learning

W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quan- tile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

QUOTA: The Quantile Option Architecture for Reinforcement Learning

S. Zhang, B. Mavrin, H. Yao, L. Kong, and B. Liu, “Quota: The quantile option architecture for reinforcement learning,” arXiv preprint arXiv:1811.02073, 2018. 20

work page internal anchor Pith review Pith/arXiv arXiv 2018