pith. sign in

arxiv: 2604.06936 · v2 · submitted 2026-04-08 · 🧮 math.OC

Adaptive Distributionally Robust Optimal Control with Bayesian Ambiguity Sets

Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3

classification 🧮 math.OC
keywords adaptive distributionally robust optimal controlBayesian ambiguity setsepisodic Bayesian learningstochastic optimal controlrisk-averse reformulationconsistency guaranteescutting-plane algorithminventory control
0
0 comments X

The pith

Bayesian learning from episodic data produces consistent and adaptive distributionally robust control policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive distributionally robust optimal control model whose ambiguity set is refined by Bayesian learning from data that arrives in separate episodes. This addresses the excessive conservatism of offline models when data are limited and extends applicability to settings where samples are collected episodically rather than all at once. Under moderate conditions the model admits a tractable risk-averse reformulation. The authors prove that the optimal value function and policy converge to those of the true distribution in the infinite-horizon case and supply finite-sample posterior credibility bounds on the value attained by the learned policy. They further establish stability under data perturbations and supply a convergent Bellman-operator cutting-plane algorithm.

Core claim

By updating the ambiguity set of a distributionally robust optimal control problem via Bayesian posteriors computed from episodic samples, one obtains a tractable risk-averse reformulation together with consistency of the optimal value function and optimal policy for infinite-horizon problems and finite-sample posterior credibility guarantees for the policy value; the resulting model is stable to sample perturbations and can be solved by a convergent Bellman-operator cutting-plane algorithm.

What carries the argument

The episodic Bayesian DROC model whose ambiguity set is updated by Bayesian posterior distributions computed from successive episodes of samples, which enables the adaptive reduction of conservatism while preserving robustness.

If this is right

  • The optimal value function and policy converge to their true counterparts for infinite-horizon stochastic optimal control.
  • The policy value satisfies finite-sample posterior credibility guarantees.
  • The model remains stable and statistically robust under perturbations of the observed samples.
  • Solutions can be computed efficiently by the Bellman-operator cutting-plane algorithm with proven convergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Bayesian updating mechanism could be applied to other sequential decision problems where data arrives in batches, such as certain reinforcement-learning tasks under distributional uncertainty.
  • The finite-sample credibility bounds give a practical way to decide when enough episodes have been observed for the policy to be deployed with quantified reliability.
  • Testing the moderate conditions on concrete problem structures would clarify how large the data requirement is in specific applications.

Load-bearing premise

Moderate conditions hold that permit the tractable risk-averse reformulation, the consistency proofs, and the credibility bounds, and that samples are generated episodically from the true underlying distribution.

What would settle it

Numerical simulations in which the policy value computed from the episodic Bayesian model fails to approach the policy value obtained under the true distribution as the number of episodes grows to infinity would falsify the consistency claim.

Figures

Figures reproduced from arXiv: 2604.06936 by Enlu Zhou, Huifu Xu, Wentao Ma, Zhiping Chen.

Figure 1
Figure 1. Figure 1: Convergence of the integrated gap of the optimal value function across episodes. [PITH_FULL_IMAGE:figures/full_fig_p034_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Quantitative statistical robustness experiment at [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: BOCP warm-start vs. cold-start for the episodic Bayesian DROC. [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Out-of-sample discounted cost under the true environment versus episode index [PITH_FULL_IMAGE:figures/full_fig_p037_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-sample discounted cost under the contaminated environment versus episode index [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
read the original abstract

In stochastic optimal control (SOC), uncertainty may arise from incomplete knowledge of the true probability distribution of the underlying environment, which is known as Knightian or epistemic uncertainty. Distributionally robust optimal control (DROC) models are subsequently proposed to tackle this source of uncertainty. While such models are effective in some practical applications, most existing DROC models are offline and can be overly conservative when data are scarce. Moreover, they cannot be applied to the case when samples are generated episodically. Motivated by the Bayesian SOC framework recently proposed by Shapiro et al.~\cite{shapiro2025episodic}, we propose an adaptive DROC model in which the ambiguity set is updated via Bayesian learning from new data. Under some moderate conditions, we derive a tractable risk-averse reformulation, establish consistency of the optimal value function and optimal policy for an infinite-horizon SOC and establish a finite-sample posterior credibility guarantee for the policy value induced by the proposed episodic Bayesian DROC model. We also study the stability and statistical robustness of the proposed model with respect to sample perturbations that often arise in data-driven environments. To solve the episodic Bayesian DROC model, we propose a Bellman-operator cutting-plane (BOCP) algorithm that is computationally efficient and provably convergent. Numerical results on an inventory control problem demonstrate the effectiveness, adaptivity, and robust performance of the proposed model and algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an adaptive distributionally robust optimal control (DROC) framework for stochastic optimal control (SOC) under epistemic uncertainty. The ambiguity set is updated via Bayesian learning from episodic data samples. Under unspecified moderate conditions, the authors derive a tractable risk-averse reformulation, establish consistency of the optimal value function and policy for infinite-horizon problems, provide a finite-sample posterior credibility guarantee, analyze stability and robustness to sample perturbations, and introduce a provably convergent Bellman-operator cutting-plane (BOCP) algorithm. Numerical validation is provided on an inventory control example.

Significance. If the moderate conditions hold and the proofs are complete, the work meaningfully extends offline DROC by enabling online Bayesian adaptation, reducing conservatism with data while retaining robustness guarantees. The consistency and credibility results, combined with the convergent BOCP algorithm, offer both theoretical and computational contributions to data-driven control. The inventory example illustrates practical relevance in operations research settings. The integration of Bayesian updating with distributionally robust control is a clear strength when the assumptions align with the application.

major comments (2)
  1. [Abstract] Abstract: All three central claims (tractable risk-averse reformulation, consistency of value function/policy for infinite-horizon SOC, and finite-sample posterior credibility guarantee) are stated to hold only 'under some moderate conditions,' yet these conditions are never enumerated or characterized. This is load-bearing because the conditions control whether the Bayesian update remains tractable, whether the Bellman operator is a contraction, and whether the credibility bound applies; without an explicit list (e.g., requirements on the ambiguity-set family, uniform integrability, or moment conditions), the scope and practical utility of the results cannot be assessed.
  2. [Abstract and model formulation] Episodic sampling assumption (referenced in abstract and likely §3): The framework requires that samples are generated episodically from the true underlying distribution for the posterior update and finite-sample guarantee to be well-defined. If this assumption is violated (common in non-stationary or biased data environments), both the consistency result and the credibility guarantee may fail to hold, narrowing the applicability of the adaptive DROC model.
minor comments (2)
  1. [Abstract] Abstract: The reference to Shapiro et al. is cited as shapiro2025episodic; ensure the full bibliographic entry is provided in the reference list and that any dependence on prior results is clearly delineated.
  2. [Numerical results] Numerical section: The inventory control example demonstrates effectiveness, but additional details on how the moderate conditions are satisfied in the example (e.g., specific distribution family or integrability) would strengthen the link between theory and numerics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment point by point below, indicating planned revisions to improve clarity and scope without misrepresenting the contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All three central claims (tractable risk-averse reformulation, consistency of value function/policy for infinite-horizon SOC, and finite-sample posterior credibility guarantee) are stated to hold only 'under some moderate conditions,' yet these conditions are never enumerated or characterized. This is load-bearing because the conditions control whether the Bayesian update remains tractable, whether the Bellman operator is a contraction, and whether the credibility bound applies; without an explicit list (e.g., requirements on the ambiguity-set family, uniform integrability, or moment conditions), the scope and practical utility of the results cannot be assessed.

    Authors: We agree that the abstract would benefit from an explicit enumeration of the moderate conditions to immediately convey scope. These conditions are fully specified in the manuscript: Assumption 2.1 requires the ambiguity set to be a weakly compact, convex collection of measures with uniformly bounded first moments; Assumption 3.2 imposes uniform integrability and Lipschitz continuity on the stage costs; and Assumption 4.1 ensures the prior has full support with the posterior concentrating under i.i.d. episodic sampling. The Bellman operator is shown to be a contraction under a discount factor strictly less than one combined with the moment bounds. We will revise the abstract to include a concise parenthetical list of these conditions and add a short summary paragraph at the end of the introduction that cross-references their locations. This change directly addresses the concern while preserving the original claims. revision: yes

  2. Referee: [Abstract and model formulation] Episodic sampling assumption (referenced in abstract and likely §3): The framework requires that samples are generated episodically from the true underlying distribution for the posterior update and finite-sample guarantee to be well-defined. If this assumption is violated (common in non-stationary or biased data environments), both the consistency result and the credibility guarantee may fail to hold, narrowing the applicability of the adaptive DROC model.

    Authors: The episodic i.i.d. sampling assumption is indeed foundational, as it underpins both the Bayesian posterior update (Section 3) and the finite-sample credibility bound (Theorem 4.3), which rely on independent episodes drawn from the true distribution. We acknowledge that the consistency and guarantee results do not automatically extend to non-stationary or biased sampling regimes. In the revised manuscript we will add an explicit paragraph in the introduction and a dedicated limitations subsection in the conclusion that states the assumption, illustrates its role via the inventory example, and outlines future extensions such as sliding-window posteriors or ambiguity-set inflation for non-stationarity. This clarifies applicability without weakening the core episodic setting, which remains relevant for many operations-research control problems. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central results build on external Bayesian SOC framework without self-referential reduction

full rationale

The derivation chain starts from the external Bayesian SOC framework of Shapiro et al. (cited as motivation) and applies standard Bayesian updating to construct ambiguity sets for DROC. Tractable risk-averse reformulation, infinite-horizon consistency of value function/policy, and finite-sample posterior credibility guarantees are all stated to hold only under unspecified moderate conditions, but these are not shown to reduce by the paper's own equations to fitted inputs or self-citations. The BOCP algorithm is a proposed solver with claimed convergence, independent of the guarantees. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear; the framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on Bayesian updating of ambiguity sets and standard assumptions from stochastic optimal control; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Moderate conditions on the ambiguity set, data generation process, and problem structure
    Invoked to derive the tractable risk-averse reformulation and to establish consistency and credibility guarantees.

pith-pipeline@v0.9.0 · 5552 in / 1247 out tokens · 50447 ms · 2026-05-10T17:25:58.450824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages

  1. [1]

    3.2 Asymptotic convergence of value function and optimal policy Recall that (2.11) in Section 2.2 assumes the existence of a solution ˆV ∗ N to the Bellman equation (2.10)

    and the monograph treatments in [26, 5]. 3.2 Asymptotic convergence of value function and optimal policy Recall that (2.11) in Section 2.2 assumes the existence of a solution ˆV ∗ N to the Bellman equation (2.10). However, the underlying rationale has not yet been fully established. In the following, we first demonstrate the existence and uniqueness of ˆV...

  2. [2]

    Using the closed-form solution (6.1), we approximate the benchmark value functionV ∗ with 105 samples

    The normalized bin probabilities arep j = F(u j)−F(u j−1) /F(U). Using the closed-form solution (6.1), we approximate the benchmark value functionV ∗ with 105 samples. In episodeN, we update the Bayesian posterior by (2.3) using the observed data and construct the ambiguity set (2.5) corresponding to the posterior distribution. We then compute the episode...

  3. [3]

    Abeille and A

    M. Abeille and A. Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. InInternational Conference on Machine Learning, pages 1–9. PMLR, 2018

  4. [4]

    C. D. Aliprantis and K. C. Border.Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006

  5. [5]

    J. O. Berger.Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 2013

  6. [6]

    Bertsekas.Dynamic Programming and Optimal Control: Volume I, volume 4

    D. Bertsekas.Dynamic Programming and Optimal Control: Volume I, volume 4. Athena Scientific, 2012

  7. [7]

    Bertsekas.Abstract Dynamic Programming

    D. Bertsekas.Abstract Dynamic Programming. Athena Scientific, 2022

  8. [8]

    Bertsekas and S

    D. Bertsekas and S. E. Shreve.Stochastic Optimal Control: The Discrete-time Case, volume 5. Athena Scientific, 1996

  9. [9]

    Bertsimas, V

    D. Bertsimas, V. Gupta, and N. Kallus. Robust sample average approximation.Mathematical Programming, 171:217–282, 2018

  10. [10]

    Carpentier, J.-P

    P. Carpentier, J.-P. Chancelier, G. Cohen, M. De Lara, and P. Girardeau. Dynamic consistency for stochastic optimal control problems.Annals of Operations Research, 200:247–263, 2012. 39

  11. [11]

    Castaing and M

    C. Castaing and M. Valadier.Convex Analysis and Measurable Multifunctions. Springer, 1977

  12. [12]

    Chen and W

    Z. Chen and W. Ma. A Bayesian approach to data-driven multi-stage stochastic optimization. Journal of Global Optimization, pages 1–28, 2024

  13. [13]

    Z. Chen, W. Ma, and B. Ji. Data-driven approximation of distributionally robust chance constraints using Bayesian credible intervals.OR Spectrum, 47(3):969–1009, 2025

  14. [14]

    W. L. Cooper and B. Rangarajan. Performance guarantees for empirical Markov decision processes with applications to multiperiod inventory models.Operations Research, 60(5):1267–1281, 2012

  15. [15]

    Delage and Y

    E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with appli- cation to data-driven problems.Operations research, 58(3):595–612, 2010

  16. [16]

    Dibiasi and D

    A. Dibiasi and D. Iselin. Measuring Knightian uncertainty.Empirical Economics, 61(4):2113–2141, 2021

  17. [17]

    Efron and T

    B. Efron and T. Hastie.Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science, volume 6. Cambridge University Press, 2021

  18. [18]

    F¨ ullner and S

    C. F¨ ullner and S. Rebennack. Stochastic dual dynamic programming and its variants: A review. SIAM Review, 67(3):415–539, 2025

  19. [19]

    R. Gao. Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality.Operations Research, 71(6):2291–2306, 2023

  20. [20]

    Gelman, J

    A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin.Bayesian Data Analysis. Chapman and Hall/CRC, 1995

  21. [21]

    Guigues, A

    V. Guigues, A. Shapiro, and Y. Cheng. Risk-averse stochastic optimal control: an efficiently computable statistical upper bound.Operations Research Letters, 51(4):393–400, 2023

  22. [22]

    Guo and H

    S. Guo and H. Xu. Distributionally robust shortfall risk optimization model and its approximation. Mathematical Programming, 174(1):473–498, 2019

  23. [23]

    Guo and H

    S. Guo and H. Xu. Statistical robustness in utility preference robust optimization models.Mathe- matical Programming, 190(1):679–720, 2021

  24. [24]

    V. Gupta. Near-optimal Bayesian ambiguity sets for distributionally robust optimization.Man- agement Science, 65(9):4242–4260, 2019

  25. [25]

    F. R. Hampel. A general qualitative definition of robustness.The Annals of Mathematical Statistics, 42(6):1887–1896, 1971

  26. [26]

    G. A. Hanasusanto, V. Roitch, D. Kuhn, and W. Wiesemann. A distributionally robust perspective on uncertainty quantification and chance constrained programming.Mathematical Programming, 151(1):35–62, 2015

  27. [27]

    W. B. Haskell, R. Jain, and D. Kalathil. Empirical dynamic programming.Mathematics of Oper- ations Research, 41(2):402–429, 2016

  28. [28]

    Hern´ andez-Lerma and J

    O. Hern´ andez-Lerma and J. B. Lasserre.Further Topics on Discrete-time Markov Control Pro- cesses, volume 42. Springer Science & Business Media, 2012

  29. [29]

    Huang, K

    J. Huang, K. Zhou, and Y. Guan. A study of distributionally robust multistage stochastic opti- mization.arXiv preprint arXiv:1708.07930, 2017

  30. [30]

    P. J. Huber and E. M. Ronchetti.Robust Statistics. John Wiley & Sons, 2011. 40

  31. [31]

    Jiang and Y

    R. Jiang and Y. Guan. Risk-averse two-stage stochastic program with distributional ambiguity. Operations Research, 66(5):1390–1405, 2018

  32. [32]

    P. Kern, A. Simroth, and H. Z¨ ahle. First-order sensitivity of the optimal value in a Markov decision model with respect to deviations in the transition probability function.Mathematical Methods of Operations Research, 92(1):165–197, 2020

  33. [33]

    Kim and I

    K. Kim and I. Yang. Distributional robustness in minimax linear quadratic control with Wasserstein distance.SIAM Journal on Control and Optimization, 61(2):458–483, 2023

  34. [34]

    H. Lam. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization.Operations Research, 67(4):1090–1105, 2019

  35. [35]

    M. Li, X. Tong, and H. Sun. Discretization and quantification for distributionally robust opti- mization with decision-dependent ambiguity sets.Optimization Methods and Software, pages 1–30, 2024

  36. [36]

    P. Li, M. Yang, and Q. Wu. Confidence interval based distributionally robust real-time economic dispatch approach considering wind power accommodation risk.IEEE Transactions on Sustainable Energy, 12(1):58–69, 2020

  37. [37]

    Y. Li, Y. Lin, E. Zhou, and F. Zhang. Risk-aware model predictive control enabled by Bayesian learning. In2022 American Control Conference (ACC), pages 108–113. IEEE, 2022

  38. [38]

    Y. Li, Y. Lin, E. Zhou, and F. Zhang. Bayesian risk-averse model predictive control with consistency and stability guarantees.arXiv preprint arXiv:2511.21871, 2025

  39. [39]

    Y. Lin, Y. Ren, and E. Zhou. Bayesian risk Markov decision processes.Advances in Neural Information Processing Systems, 35:17430–17442, 2022

  40. [40]

    Liu and H

    Y. Liu and H. Xu. Stability analysis of stochastic programs with second order dominance con- straints.Mathematical Programming, 142:435–460, 2013

  41. [41]

    W. Ma, Z. Chen, and X. Chen. Bayesian distributionally robust variational inequalities: regular- ization and quantification.arXiv preprint arXiv:2509.16537, 2025

  42. [42]

    W. Ma, Z. Chen, and H. Xu. A Bayesian composite risk approach for stochastic optimal control and Markov decision processes.arXiv preprint arXiv:2412.16488, 2024

  43. [43]

    Mehrotra and H

    S. Mehrotra and H. Zhang. Models and algorithms for distributionally robust least squares prob- lems.Mathematical Programming, 146(1):123–141, 2014

  44. [44]

    Mohajerin Esfahani and D

    P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations.Mathematical Program- ming, 171(1):115–166, 2018

  45. [45]

    Nilim and L

    A. Nilim and L. El Ghaoui.Robust markov decision processes with uncertain transition matrices. PhD thesis, University of California, Berkeley, 2004

  46. [46]

    Osband, D

    I. Osband, D. Russo, and B. Van Roy. (More) efficient reinforcement learning via posterior sam- pling.Advances in Neural Information Processing Systems, 26, 2013

  47. [47]

    Pfeiffer

    L. Pfeiffer. Two approaches to stochastic optimal control problems with a final-time expectation constraint.Applied Mathematics & Optimization, 77:377–404, 2018

  48. [48]

    G. C. Pflug and A. Pichler.Multistage Stochastic Optimization, volume 1104. Springer, 2014

  49. [49]

    A. B. Philpott, V. L. de Matos, and L. Kapelevich. Distributionally robust SDDP.Computational Management Science, 15:431–454, 2018. 41

  50. [50]

    A. B. Philpott and Z. Guan. On the convergence of stochastic dual dynamic programming and related methods.Operations Research Letters, 36(4):450–455, 2008

  51. [51]

    Pichler and H

    A. Pichler and H. Xu. Quantitative stability analysis for minimax distributionally robust risk optimization.Mathematical Programming, 191(1):47–77, 2022

  52. [52]

    M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

  53. [53]

    Rahimian, G

    H. Rahimian, G. Bayraksan, and T. H. De-Mello. Effective scenarios in multistage distributionally robust optimization with a focus on total variation distance.SIAM Journal on Optimization, 32(3):1698–1727, 2022

  54. [54]

    U. Rieder. Bayesian dynamic programming.Advances in Applied Probability, 7(2):330–348, 1975

  55. [55]

    R. T. Rockafellar.Convex Analysis. Princeton university press, 2015

  56. [56]

    R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-at-risk.Journal of risk, 2:21–42, 2000

  57. [57]

    R¨ omisch

    W. R¨ omisch. Stability of stochastic programming problems. InHandbooks in Operations Research and Management Science, volume 10, pages 483–554. Elsevier, 2003

  58. [58]

    A. Shapiro. Minimax and risk averse multistage stochastic programming.European Journal of Operational Research, 219(3):719–726, 2012

  59. [59]

    Shapiro, D

    A. Shapiro, D. Dentcheva, and A. Ruszczynski.Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2021

  60. [60]

    Shapiro, E

    A. Shapiro, E. Zhou, Y. Lin, and Y. Wang. Episodic Bayesian optimal control with unknown randomness distributions.Operations Research, 2025

  61. [61]

    M. Strens. A Bayesian framework for reinforcement learning. InInternational Conference on Machine Learning, volume 2000, pages 943–950, 2000

  62. [62]

    Taskesen, D

    B. Taskesen, D. Iancu, C ¸ . Ko¸ cyi˘ git, and D. Kuhn. Distributionally robust linear quadratic control. Advances in Neural Information Processing Systems, 36:18613–18632, 2023

  63. [63]

    Tzortzis, C

    I. Tzortzis, C. D. Charalambous, and T. Charalambous. Infinite horizon average cost dynamic programming subject to total variation distance ambiguity.SIAM Journal on Control and Opti- mization, 57(4):2843–2872, 2019

  64. [64]

    A. W. Van Der Vaart and J. A. Wellner. Weak convergence. InWeak convergence and empirical processes: with applications to statistics, pages 16–28. Springer, 1996

  65. [65]

    B. P. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari. Distributionally robust control of constrained stochastic systems.IEEE Transactions on Automatic Control, 61(2):430–442, 2015

  66. [66]

    H. Wang, L. He, R. Gao, and F. Calmon. Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions.Advances in Neural Information Processing Systems, 36, 2024

  67. [67]

    Wang and E

    Y. Wang and E. Zhou. Bayesian risk-averse Q-learning with streaming observations.Advances in Neural Information Processing Systems, 36:75967–75992, 2023

  68. [68]

    Z. Wang, P. W. Glynn, and Y. Ye. Likelihood robust optimization for data-driven problems. Computational Management Science, 13:241–261, 2016

  69. [69]

    J. Wessels. Markov programming by successive approximations with respect to weighted supremum norms.Journal of mathematical analysis and applications, 58(2):326–335, 1977. 42

  70. [70]

    W. Xie, C. Li, Y. Wu, and P. Zhang. A nonparametric Bayesian framework for uncertainty quan- tification in stochastic simulation.SIAM/ASA Journal on Uncertainty Quantification, 9(4):1527– 1552, 2021

  71. [71]

    Xu and S

    H. Xu and S. Mannor. Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

  72. [72]

    Xu and S

    H. Xu and S. Zhang. Quantitative statistical robustness in distributionally robust optimization models.Pacific Journal of Optimization Special Issue, 2021

  73. [73]

    I. Yang. Wasserstein distributionally robust stochastic control: A data-driven approach.IEEE Transactions on Automatic Control, 66(8):3863–3870, 2020

  74. [74]

    Z. Yang, Z. Chen, and H. Xu. Stability analysis of an integrated multistage stochastic programming and Markov decision process problem.arXiv preprint arXiv:2509.22194, 2025. 43