pith. sign in

arxiv: 2209.13050 · v1 · submitted 2022-09-26 · 🧮 math.OC

Constrained Policy Optimization for Stochastic Optimal Control under Nonstationary Uncertainties

Pith reviewed 2026-05-24 11:22 UTC · model grok-4.3

classification 🧮 math.OC
keywords stochastic optimal controlpolicy optimizationMarkov embeddabilitynonstationary uncertaintiesnonlinear programminginterior-point methodsautomatic differentiation
0
0 comments X

The pith

Stochastic optimal control under nonstationary uncertainties reduces to constrained policy optimization via Markov embeddability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that by assuming Markov embeddability, one can reformulate stochastic optimal control as a policy optimization problem in an augmented state space that includes the uncertainty process. This infinite-dimensional problem is then discretized into a finite nonlinear program through function approximation, deterministic sampling, and truncation in time. Solving this program with automatic differentiation and interior-point methods yields control policies, as demonstrated in a numerical example. The approach addresses systems where uncertainties change over time, which standard stationary methods cannot handle directly.

Core claim

Under the Markov embeddability assumption, the stochastic optimal control problem is cast as a policy optimization problem over the augmented state space. This infinite-dimensional problem is approximated as a finite-dimensional nonlinear program by applying function approximation, deterministic sampling, and temporal truncation. The approximated problem is solved using automatic differentiation and condensed-space interior-point methods.

What carries the argument

Markov embeddability assumption, which embeds the nonstationary uncertainty process into an augmented Markov state to allow policy optimization.

If this is right

  • The stochastic optimal control problem becomes equivalent to optimizing a policy over the augmented state space.
  • The infinite-dimensional problem is reduced to a tractable finite nonlinear program.
  • Automatic differentiation supplies the gradients needed for condensed-space interior-point solvers.
  • A numerical demonstration confirms that the resulting policy performs as intended on the example system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If Markov embeddability can be verified for common classes of time-varying disturbances, the reformulation would extend to many engineering control tasks.
  • The open questions on asymptotic exactness indicate that convergence rates under increasing sample size and horizon length remain to be quantified.
  • The sampling-based approximation could be replaced by quadrature rules or other deterministic integration schemes to improve accuracy.

Load-bearing premise

The nonstationary uncertainty process must satisfy Markov embeddability so that the augmented state captures the dynamics without loss of information.

What would settle it

A concrete nonstationary uncertainty process that violates Markov embeddability, for which the method produces a policy whose achieved cost differs from the true optimum by a measurable amount.

Figures

Figures reproduced from arXiv: 2209.13050 by Emil Contantinescu, Fran\c{c}ois Pacaud, Mihai Anitescu, Sungho Shin.

Figure 1
Figure 1. Figure 1: Closed-loop simulation of PO, LQR, and MPC (nominal). [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Closed-loop simulation of PO, LQR, and MPC (noisy). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

This article presents a constrained policy optimization approach for the optimal control of systems under nonstationary uncertainties. We introduce an assumption that we call Markov embeddability that allows us to cast the stochastic optimal control problem as a policy optimization problem over the augmented state space. Then, the infinite-dimensional policy optimization problem is approximated as a finite-dimensional nonlinear program by applying function approximation, deterministic sampling, and temporal truncation. The approximated problem is solved by using automatic differentiation and condensed-space interior-point methods. We formulate several conceptual and practical open questions regarding the asymptotic exactness of the approximation and the solution strategies for the approximated problem. As a proof of concept, we provide a numerical example demonstrating the performance of the control policy obtained by the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes casting stochastic optimal control problems under nonstationary uncertainties as constrained policy optimization problems over an augmented state space, enabled by a Markov embeddability assumption. The resulting infinite-dimensional problem is approximated as a finite-dimensional nonlinear program via function approximation, deterministic sampling, and temporal truncation; the NLP is then solved using automatic differentiation and condensed-space interior-point methods. Several open questions on asymptotic exactness of the approximation are explicitly formulated, and the method is illustrated on a single numerical example.

Significance. If the open questions on asymptotic exactness were resolved with positive convergence results, the framework could offer a systematic way to apply modern nonlinear programming tools to constrained stochastic control with nonstationary uncertainty. The explicit use of automatic differentiation and interior-point methods is a practical strength, and the formulation of open questions provides a clear research agenda. At present, however, the absence of any error bounds or consistency analysis limits the result to a conceptual proposal whose practical significance remains to be demonstrated.

major comments (2)
  1. [Abstract] Abstract: the central claim that the finite NLP obtained by function approximation, deterministic sampling, and temporal truncation can be used to solve the original constrained stochastic optimal control problem rests on the asymptotic exactness of this scheme, yet the manuscript itself states that this exactness is posed as an open question with no accompanying error bounds, consistency proof, or convergence analysis supplied anywhere in the text.
  2. [Numerical example] Numerical example section: validation is limited to a single numerical example with no comparison against alternative methods for nonstationary stochastic control or against the infinite-dimensional problem, making it impossible to assess whether the interior-point solution of the approximated NLP faithfully represents the original problem.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the detailed review. The manuscript is explicitly framed as a conceptual proposal that formulates open questions on asymptotic exactness rather than claiming to resolve them. We respond point by point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the finite NLP obtained by function approximation, deterministic sampling, and temporal truncation can be used to solve the original constrained stochastic optimal control problem rests on the asymptotic exactness of this scheme, yet the manuscript itself states that this exactness is posed as an open question with no accompanying error bounds, consistency proof, or convergence analysis supplied anywhere in the text.

    Authors: The abstract does not advance a claim that the finite NLP solves the original problem. It describes the Markov-embeddability formulation, the approximation steps, the use of automatic differentiation and interior-point methods to solve the resulting NLP, and then states that open questions on asymptotic exactness are formulated. The contribution is therefore the casting of the problem and the practical solution procedure for the approximation, with the open questions serving as an explicit research agenda. No error bounds are supplied because their derivation is left open. revision: no

  2. Referee: [Numerical example] Numerical example section: validation is limited to a single numerical example with no comparison against alternative methods for nonstationary stochastic control or against the infinite-dimensional problem, making it impossible to assess whether the interior-point solution of the approximated NLP faithfully represents the original problem.

    Authors: The numerical example is presented solely as a proof of concept, consistent with the abstract wording. A single illustrative instance is appropriate for demonstrating that the overall pipeline (formulation, approximation, and solver) can be executed. Systematic comparisons to other nonstationary stochastic control methods or to the infinite-dimensional problem would require additional theoretical and computational machinery that lies outside the scope of the current conceptual contribution. revision: no

standing simulated objections not resolved
  • Derivation of error bounds, consistency proofs, or convergence analysis for the approximation scheme
  • Empirical comparisons against alternative methods or the infinite-dimensional formulation

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit assumption and standard methods with open questions stated.

full rationale

The paper introduces Markov embeddability as a new assumption to recast the SOC problem, then applies function approximation, deterministic sampling, and temporal truncation to obtain a finite NLP solved via automatic differentiation and interior-point methods. It explicitly formulates open questions on asymptotic exactness rather than claiming convergence. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The central claim remains an approximation approach whose validity is left partially open, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Markov embeddability assumption introduced in the paper.

axioms (1)
  • domain assumption Markov embeddability assumption
    Allows casting SOC as policy optimization over augmented state space.

pith-pipeline@v0.9.0 · 5664 in / 922 out tokens · 20258 ms · 2026-05-24T11:22:44.724031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    Bertsekas, Dynamic programming and optimal control: Volume I

    D. Bertsekas, Dynamic programming and optimal control: Volume I . Athena scientific, 2012, vol. 1

  2. [2]

    A note on certainty equivalence in dynamic planning,

    H. Theil, “A note on certainty equivalence in dynamic planning,” Econometrica: Journal of the Econometric Society , pp. 346–349, 1957

  3. [3]

    Dynamic programming under uncertainty with a quadratic criterion function,

    H. A. Simon, “Dynamic programming under uncertainty with a quadratic criterion function,” Econometrica, Journal of the Econometric Society, pp. 74–81, 1956

  4. [4]

    Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games,

    D. Jacobson, “Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games,” IEEE Transactions on Automatic Control , vol. 18, no. 2, pp. 124–131, 1973

  5. [5]

    Decomposition and partitioning methods for multistage stochastic linear programs,

    J. R. Birge, “Decomposition and partitioning methods for multistage stochastic linear programs,” Operations research, vol. 33, no. 5, pp. 989–1007, 1985

  6. [6]

    Nested decomposition for dynamic models,

    J. K. Ho and A. S. Manne, “Nested decomposition for dynamic models,” Mathematical Programming, vol. 6, no. 1, pp. 121–140, 1974

  7. [7]

    Applying the progressive hedging algorithm to stochastic generalized networks,

    J. M. Mulvey and H. Vladimirou, “Applying the progressive hedging algorithm to stochastic generalized networks,” Annals of Operations Research, vol. 31, no. 1, pp. 399–424, 1991

  8. [8]

    Scenarios and policy aggrega- tion in optimization under uncertainty,

    R. T. Rockafellar and R. J.-B. Wets, “Scenarios and policy aggrega- tion in optimization under uncertainty,” Mathematics of Operations Research, vol. 16, no. 1, pp. 119–147, 1991

  9. [9]

    Multi-stage stochastic optimization applied to energy planning,

    M. V . Pereira and L. M. Pinto, “Multi-stage stochastic optimization applied to energy planning,” Mathematical Programming, vol. 52, no. 1, pp. 359–375, 1991

  10. [10]

    When to trust your model: Model-based policy optimization,

    M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” in Advances in Neural Information Processing Systems, 2019, pp. 12 498–12 509

  11. [11]

    Optimization methods for large- scale machine learning,

    L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large- scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018

  12. [12]

    Nocedal and S

    J. Nocedal and S. J. Wright, Numerical optimization. Springer, 1999

  13. [13]

    Global convergence of policy gradient methods for the linear quadratic regulator,

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning . PMLR, 2018, pp. 1467–1476

  14. [14]

    Direct policy optimization using deterministic sampling and collocation,

    T. A. Howell, C. Fu, and Z. Manchester, “Direct policy optimization using deterministic sampling and collocation,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5324–5331, 2021

  15. [15]

    Stochastic model predictive control with joint chance constraints,

    J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,” International Journal of Control , vol. 93, no. 1, pp. 126–139, 2020

  16. [16]

    J. R. Birge and F. Louveaux, Introduction to stochastic programming . Springer Science & Business Media, 2011

  17. [17]

    Reinforcement learning for selective key applications in power systems: Recent advances and future challenges,

    X. Chen, G. Qu, Y . Tang, S. Low, and N. Li, “Reinforcement learning for selective key applications in power systems: Recent advances and future challenges,” IEEE Transactions on Smart Grid , 2022

  18. [18]

    Economic opportunities for industrial systems from frequency regulation markets,

    A. W. Dowling and V . M. Zavala, “Economic opportunities for industrial systems from frequency regulation markets,” Computers & Chemical Engineering, vol. 114, pp. 254–264, 2018

  19. [19]

    Optimal demand response scheduling of an industrial air separation unit using data-driven dynamic models,

    C. Tsay, A. Kumar, J. Flores-Cerrillo, and M. Baldea, “Optimal demand response scheduling of an industrial air separation unit using data-driven dynamic models,” Computers & Chemical Engineering , vol. 126, pp. 22–34, 2019

  20. [20]

    On differential stability in stochastic programming,

    A. Shapiro, “On differential stability in stochastic programming,” Mathematical Programming, vol. 47, no. 1, pp. 107–116, 1990

  21. [21]

    On a time consistency concept in risk averse multistage stochastic programming,

    ——, “On a time consistency concept in risk averse multistage stochastic programming,” Operations Research Letters, vol. 37, no. 3, pp. 143–147, 2009

  22. [22]

    Detecting strange attractors in turbulence,

    F. Takens, “Detecting strange attractors in turbulence,” in Dynamical systems and turbulence, Warwick 1980 . Springer, 1981, pp. 366–381

  23. [23]

    Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism,

    K. K. Lin and F. Lu, “Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism,” Journal of Computational Physics, vol. 424, p. 109864, 2021. Fig. 1. Closed-loop simulation of PO, LQR, and MPC (nominal). Fig. 2. Closed-loop simulation of PO, LQR, and MPC (noisy). TABLE I PERFORMANCE COMPARISON OF PO, LQR, AND MPC ( NOMINAL )...

  24. [24]

    A discrete approach to stochastic parametrization and dimensional reduction in nonlinear dynamics,

    A. Chorin and F. Lu, “A discrete approach to stochastic parametrization and dimensional reduction in nonlinear dynamics,” Proceedings of the National Academy of Sciences , 2015

  25. [25]

    Operator inference of non-Markovian terms for learning reduced models from partially observed state trajectories,

    W. I. T. Uy and B. Peherstorfer, “Operator inference of non-Markovian terms for learning reduced models from partially observed state trajectories,” Journal of Scientific Computing , vol. 88, no. 3, pp. 1–31, 2021

  26. [26]

    A data–driven approximation of the Koopman operator: Extending dynamic mode decomposition,

    M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–driven approximation of the Koopman operator: Extending dynamic mode decomposition,” Journal of Nonlinear Science , vol. 25, no. 6, pp. 1307–1346, 2015

  27. [27]

    Camera: A method for cost-aware, adaptive, multifidelity, efficient reliability analysis,

    S. A. Renganathan, V . Rao, and I. M. Navon, “Camera: A method for cost-aware, adaptive, multifidelity, efficient reliability analysis,” arXiv preprint arXiv:2203.01436, 2022

  28. [28]

    The t-model as a large eddy simulation model for the Navier–Stokes equations,

    A. J. Chandy and S. H. Frankel, “The t-model as a large eddy simulation model for the Navier–Stokes equations,” Multiscale Modeling & Simulation, vol. 8, no. 2, pp. 445–462, 2010

  29. [29]

    Physics-based covariance models for Gaussian processes with multiple outputs,

    E. Constantinescu and M. Anitescu, “Physics-based covariance models for Gaussian processes with multiple outputs,” International Journal for Uncertainty Quantification , vol. 3, no. 1, pp. 47–71, 2013

  30. [30]

    D. P. Bertsekas and S. E. Shreve, Stochastic optimal control: the discrete-time case. Athena Scientific, 1996, vol. 5

  31. [31]

    M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  32. [32]

    The explicit linear quadratic regulator for constrained systems,

    A. Bemporad, M. Morari, V . Dua, and E. N. Pistikopoulos, “The explicit linear quadratic regulator for constrained systems,” Automatica, vol. 38, no. 1, pp. 3–20, 2002

  33. [33]

    Approximation by superpositions of a sigmoidal function,

    G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989

  34. [34]

    Asymptotic behavior of optimal solutions in stochastic programming,

    A. Shapiro, “Asymptotic behavior of optimal solutions in stochastic programming,” Mathematics of Operations Research , vol. 18, no. 4, pp. 829–845, 1993

  35. [35]

    On the rate of convergence of optimal solutions of Monte Carlo approximations of stochastic programs,

    A. Shapiro and T. Homem-de Mello, “On the rate of convergence of optimal solutions of Monte Carlo approximations of stochastic programs,” SIAM journal on optimization , vol. 11, no. 1, pp. 70–86, 2000

  36. [36]

    Sample average approximation with heavier tails i: non-asymptotic bounds with weak assumptions and stochastic constraints,

    R. I. Oliveira and P. Thompson, “Sample average approximation with heavier tails i: non-asymptotic bounds with weak assumptions and stochastic constraints,” Mathematical Programming, pp. 1–48, 2022

  37. [37]

    Exponential decay in the sensitivity analysis of nonlinear dynamic programming,

    S. Na and M. Anitescu, “Exponential decay in the sensitivity analysis of nonlinear dynamic programming,” SIAM Journal on Optimization , vol. 30, no. 2, pp. 1527–1554, 2020

  38. [38]

    Exponential decay of sensitivity in graph-structured nonlinear programs,

    S. Shin, M. Anitescu, and V . M. Zavala, “Exponential decay of sensitivity in graph-structured nonlinear programs,” SIAM Journal on Optimization, 2022

  39. [39]

    Perturbation- based regret analysis of predictive control in linear time varying systems,

    Y . Lin, Y . Hu, G. Shi, H. Sun, G. Qu, and A. Wierman, “Perturbation- based regret analysis of predictive control in linear time varying systems,” Advances in Neural Information Processing Systems , vol. 34, 2021

  40. [40]

    Mastering the game of Go without human knowledge,

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017

  41. [41]

    Don't Unroll Adjoint: Differentiating SSA-Form Programs

    M. Innes, “Don’t unroll adjoint: Differentiating SSA-form programs,” CoRR, vol. abs/1810.07951, 2018

  42. [42]

    JAX: composable transformations of Python+ NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman- Milne et al., “JAX: composable transformations of Python+ NumPy programs,” Version 0.2, vol. 5, pp. 14–24, 2018

  43. [43]

    Automatic differentiation in PyTorch,

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” 2017

  44. [44]

    Graph-Based Modeling and Decomposition of Energy Infrastructures

    S. Shin, C. Coffrin, K. Sundar, and V . M. Zavala, “Graph-based modeling and decomposition of energy infrastructures,” arXiv preprint arXiv:2010.02404, 2020

  45. [45]

    Forward-Mode Automatic Differentiation in Julia

    J. Revels, M. Lubin, and T. Papamarkou, “Forward-mode automatic differentiation in Julia,” arXiv:1607.07892 [cs.MS], 2016

  46. [46]

    NLPModels.jl: Data structures for optimization models,

    D. Orban, A. S. Siqueira, and contributors, “NLPModels.jl: Data structures for optimization models,” https://github.com/ JuliaSmoothOptimizers/NLPModels.jl, July 2020

  47. [47]

    Available: https://github.com/sshin23/con-pol-opt-code

    [Online]. Available: https://github.com/sshin23/con-pol-opt-code

  48. [48]

    J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design . Nob Hill Publishing Madison, WI, 2017, vol. 2

  49. [49]

    Stochastic model predictive control: An overview and perspectives for future research,

    A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,” IEEE Control Systems Magazine , vol. 36, no. 6, pp. 30–44, 2016

  50. [50]

    Stability properties of multi-stage nonlinear model predictive control,

    S. Lucia, S. Subramanian, D. Limon, and S. Engell, “Stability properties of multi-stage nonlinear model predictive control,” Systems & Control Letters, vol. 143, p. 104743, 2020

  51. [51]

    Scenario-based model predictive control of stochastic constrained linear systems,

    D. Bernardini and A. Bemporad, “Scenario-based model predictive control of stochastic constrained linear systems,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference . IEEE, 2009, pp. 6333–6338. Government License: The submitted manuscript has been cre- ated by UChicago Argonne, L...