Constrained Policy Optimization for Stochastic Optimal Control under Nonstationary Uncertainties
Pith reviewed 2026-05-24 11:22 UTC · model grok-4.3
The pith
Stochastic optimal control under nonstationary uncertainties reduces to constrained policy optimization via Markov embeddability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the Markov embeddability assumption, the stochastic optimal control problem is cast as a policy optimization problem over the augmented state space. This infinite-dimensional problem is approximated as a finite-dimensional nonlinear program by applying function approximation, deterministic sampling, and temporal truncation. The approximated problem is solved using automatic differentiation and condensed-space interior-point methods.
What carries the argument
Markov embeddability assumption, which embeds the nonstationary uncertainty process into an augmented Markov state to allow policy optimization.
If this is right
- The stochastic optimal control problem becomes equivalent to optimizing a policy over the augmented state space.
- The infinite-dimensional problem is reduced to a tractable finite nonlinear program.
- Automatic differentiation supplies the gradients needed for condensed-space interior-point solvers.
- A numerical demonstration confirms that the resulting policy performs as intended on the example system.
Where Pith is reading between the lines
- If Markov embeddability can be verified for common classes of time-varying disturbances, the reformulation would extend to many engineering control tasks.
- The open questions on asymptotic exactness indicate that convergence rates under increasing sample size and horizon length remain to be quantified.
- The sampling-based approximation could be replaced by quadrature rules or other deterministic integration schemes to improve accuracy.
Load-bearing premise
The nonstationary uncertainty process must satisfy Markov embeddability so that the augmented state captures the dynamics without loss of information.
What would settle it
A concrete nonstationary uncertainty process that violates Markov embeddability, for which the method produces a policy whose achieved cost differs from the true optimum by a measurable amount.
Figures
read the original abstract
This article presents a constrained policy optimization approach for the optimal control of systems under nonstationary uncertainties. We introduce an assumption that we call Markov embeddability that allows us to cast the stochastic optimal control problem as a policy optimization problem over the augmented state space. Then, the infinite-dimensional policy optimization problem is approximated as a finite-dimensional nonlinear program by applying function approximation, deterministic sampling, and temporal truncation. The approximated problem is solved by using automatic differentiation and condensed-space interior-point methods. We formulate several conceptual and practical open questions regarding the asymptotic exactness of the approximation and the solution strategies for the approximated problem. As a proof of concept, we provide a numerical example demonstrating the performance of the control policy obtained by the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes casting stochastic optimal control problems under nonstationary uncertainties as constrained policy optimization problems over an augmented state space, enabled by a Markov embeddability assumption. The resulting infinite-dimensional problem is approximated as a finite-dimensional nonlinear program via function approximation, deterministic sampling, and temporal truncation; the NLP is then solved using automatic differentiation and condensed-space interior-point methods. Several open questions on asymptotic exactness of the approximation are explicitly formulated, and the method is illustrated on a single numerical example.
Significance. If the open questions on asymptotic exactness were resolved with positive convergence results, the framework could offer a systematic way to apply modern nonlinear programming tools to constrained stochastic control with nonstationary uncertainty. The explicit use of automatic differentiation and interior-point methods is a practical strength, and the formulation of open questions provides a clear research agenda. At present, however, the absence of any error bounds or consistency analysis limits the result to a conceptual proposal whose practical significance remains to be demonstrated.
major comments (2)
- [Abstract] Abstract: the central claim that the finite NLP obtained by function approximation, deterministic sampling, and temporal truncation can be used to solve the original constrained stochastic optimal control problem rests on the asymptotic exactness of this scheme, yet the manuscript itself states that this exactness is posed as an open question with no accompanying error bounds, consistency proof, or convergence analysis supplied anywhere in the text.
- [Numerical example] Numerical example section: validation is limited to a single numerical example with no comparison against alternative methods for nonstationary stochastic control or against the infinite-dimensional problem, making it impossible to assess whether the interior-point solution of the approximated NLP faithfully represents the original problem.
Simulated Author's Rebuttal
We thank the referee for the detailed review. The manuscript is explicitly framed as a conceptual proposal that formulates open questions on asymptotic exactness rather than claiming to resolve them. We respond point by point to the major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the finite NLP obtained by function approximation, deterministic sampling, and temporal truncation can be used to solve the original constrained stochastic optimal control problem rests on the asymptotic exactness of this scheme, yet the manuscript itself states that this exactness is posed as an open question with no accompanying error bounds, consistency proof, or convergence analysis supplied anywhere in the text.
Authors: The abstract does not advance a claim that the finite NLP solves the original problem. It describes the Markov-embeddability formulation, the approximation steps, the use of automatic differentiation and interior-point methods to solve the resulting NLP, and then states that open questions on asymptotic exactness are formulated. The contribution is therefore the casting of the problem and the practical solution procedure for the approximation, with the open questions serving as an explicit research agenda. No error bounds are supplied because their derivation is left open. revision: no
-
Referee: [Numerical example] Numerical example section: validation is limited to a single numerical example with no comparison against alternative methods for nonstationary stochastic control or against the infinite-dimensional problem, making it impossible to assess whether the interior-point solution of the approximated NLP faithfully represents the original problem.
Authors: The numerical example is presented solely as a proof of concept, consistent with the abstract wording. A single illustrative instance is appropriate for demonstrating that the overall pipeline (formulation, approximation, and solver) can be executed. Systematic comparisons to other nonstationary stochastic control methods or to the infinite-dimensional problem would require additional theoretical and computational machinery that lies outside the scope of the current conceptual contribution. revision: no
- Derivation of error bounds, consistency proofs, or convergence analysis for the approximation scheme
- Empirical comparisons against alternative methods or the infinite-dimensional formulation
Circularity Check
No significant circularity; derivation relies on explicit assumption and standard methods with open questions stated.
full rationale
The paper introduces Markov embeddability as a new assumption to recast the SOC problem, then applies function approximation, deterministic sampling, and temporal truncation to obtain a finite NLP solved via automatic differentiation and interior-point methods. It explicitly formulates open questions on asymptotic exactness rather than claiming convergence. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The central claim remains an approximation approach whose validity is left partially open, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Markov embeddability assumption
Reference graph
Works this paper leans on
-
[1]
Bertsekas, Dynamic programming and optimal control: Volume I
D. Bertsekas, Dynamic programming and optimal control: Volume I . Athena scientific, 2012, vol. 1
work page 2012
-
[2]
A note on certainty equivalence in dynamic planning,
H. Theil, “A note on certainty equivalence in dynamic planning,” Econometrica: Journal of the Econometric Society , pp. 346–349, 1957
work page 1957
-
[3]
Dynamic programming under uncertainty with a quadratic criterion function,
H. A. Simon, “Dynamic programming under uncertainty with a quadratic criterion function,” Econometrica, Journal of the Econometric Society, pp. 74–81, 1956
work page 1956
-
[4]
D. Jacobson, “Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games,” IEEE Transactions on Automatic Control , vol. 18, no. 2, pp. 124–131, 1973
work page 1973
-
[5]
Decomposition and partitioning methods for multistage stochastic linear programs,
J. R. Birge, “Decomposition and partitioning methods for multistage stochastic linear programs,” Operations research, vol. 33, no. 5, pp. 989–1007, 1985
work page 1985
-
[6]
Nested decomposition for dynamic models,
J. K. Ho and A. S. Manne, “Nested decomposition for dynamic models,” Mathematical Programming, vol. 6, no. 1, pp. 121–140, 1974
work page 1974
-
[7]
Applying the progressive hedging algorithm to stochastic generalized networks,
J. M. Mulvey and H. Vladimirou, “Applying the progressive hedging algorithm to stochastic generalized networks,” Annals of Operations Research, vol. 31, no. 1, pp. 399–424, 1991
work page 1991
-
[8]
Scenarios and policy aggrega- tion in optimization under uncertainty,
R. T. Rockafellar and R. J.-B. Wets, “Scenarios and policy aggrega- tion in optimization under uncertainty,” Mathematics of Operations Research, vol. 16, no. 1, pp. 119–147, 1991
work page 1991
-
[9]
Multi-stage stochastic optimization applied to energy planning,
M. V . Pereira and L. M. Pinto, “Multi-stage stochastic optimization applied to energy planning,” Mathematical Programming, vol. 52, no. 1, pp. 359–375, 1991
work page 1991
-
[10]
When to trust your model: Model-based policy optimization,
M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” in Advances in Neural Information Processing Systems, 2019, pp. 12 498–12 509
work page 2019
-
[11]
Optimization methods for large- scale machine learning,
L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large- scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018
work page 2018
- [12]
-
[13]
Global convergence of policy gradient methods for the linear quadratic regulator,
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning . PMLR, 2018, pp. 1467–1476
work page 2018
-
[14]
Direct policy optimization using deterministic sampling and collocation,
T. A. Howell, C. Fu, and Z. Manchester, “Direct policy optimization using deterministic sampling and collocation,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5324–5331, 2021
work page 2021
-
[15]
Stochastic model predictive control with joint chance constraints,
J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,” International Journal of Control , vol. 93, no. 1, pp. 126–139, 2020
work page 2020
-
[16]
J. R. Birge and F. Louveaux, Introduction to stochastic programming . Springer Science & Business Media, 2011
work page 2011
-
[17]
X. Chen, G. Qu, Y . Tang, S. Low, and N. Li, “Reinforcement learning for selective key applications in power systems: Recent advances and future challenges,” IEEE Transactions on Smart Grid , 2022
work page 2022
-
[18]
Economic opportunities for industrial systems from frequency regulation markets,
A. W. Dowling and V . M. Zavala, “Economic opportunities for industrial systems from frequency regulation markets,” Computers & Chemical Engineering, vol. 114, pp. 254–264, 2018
work page 2018
-
[19]
C. Tsay, A. Kumar, J. Flores-Cerrillo, and M. Baldea, “Optimal demand response scheduling of an industrial air separation unit using data-driven dynamic models,” Computers & Chemical Engineering , vol. 126, pp. 22–34, 2019
work page 2019
-
[20]
On differential stability in stochastic programming,
A. Shapiro, “On differential stability in stochastic programming,” Mathematical Programming, vol. 47, no. 1, pp. 107–116, 1990
work page 1990
-
[21]
On a time consistency concept in risk averse multistage stochastic programming,
——, “On a time consistency concept in risk averse multistage stochastic programming,” Operations Research Letters, vol. 37, no. 3, pp. 143–147, 2009
work page 2009
-
[22]
Detecting strange attractors in turbulence,
F. Takens, “Detecting strange attractors in turbulence,” in Dynamical systems and turbulence, Warwick 1980 . Springer, 1981, pp. 366–381
work page 1980
-
[23]
Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism,
K. K. Lin and F. Lu, “Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism,” Journal of Computational Physics, vol. 424, p. 109864, 2021. Fig. 1. Closed-loop simulation of PO, LQR, and MPC (nominal). Fig. 2. Closed-loop simulation of PO, LQR, and MPC (noisy). TABLE I PERFORMANCE COMPARISON OF PO, LQR, AND MPC ( NOMINAL )...
work page 2021
-
[24]
A discrete approach to stochastic parametrization and dimensional reduction in nonlinear dynamics,
A. Chorin and F. Lu, “A discrete approach to stochastic parametrization and dimensional reduction in nonlinear dynamics,” Proceedings of the National Academy of Sciences , 2015
work page 2015
-
[25]
W. I. T. Uy and B. Peherstorfer, “Operator inference of non-Markovian terms for learning reduced models from partially observed state trajectories,” Journal of Scientific Computing , vol. 88, no. 3, pp. 1–31, 2021
work page 2021
-
[26]
A data–driven approximation of the Koopman operator: Extending dynamic mode decomposition,
M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–driven approximation of the Koopman operator: Extending dynamic mode decomposition,” Journal of Nonlinear Science , vol. 25, no. 6, pp. 1307–1346, 2015
work page 2015
-
[27]
Camera: A method for cost-aware, adaptive, multifidelity, efficient reliability analysis,
S. A. Renganathan, V . Rao, and I. M. Navon, “Camera: A method for cost-aware, adaptive, multifidelity, efficient reliability analysis,” arXiv preprint arXiv:2203.01436, 2022
-
[28]
The t-model as a large eddy simulation model for the Navier–Stokes equations,
A. J. Chandy and S. H. Frankel, “The t-model as a large eddy simulation model for the Navier–Stokes equations,” Multiscale Modeling & Simulation, vol. 8, no. 2, pp. 445–462, 2010
work page 2010
-
[29]
Physics-based covariance models for Gaussian processes with multiple outputs,
E. Constantinescu and M. Anitescu, “Physics-based covariance models for Gaussian processes with multiple outputs,” International Journal for Uncertainty Quantification , vol. 3, no. 1, pp. 47–71, 2013
work page 2013
-
[30]
D. P. Bertsekas and S. E. Shreve, Stochastic optimal control: the discrete-time case. Athena Scientific, 1996, vol. 5
work page 1996
-
[31]
M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[32]
The explicit linear quadratic regulator for constrained systems,
A. Bemporad, M. Morari, V . Dua, and E. N. Pistikopoulos, “The explicit linear quadratic regulator for constrained systems,” Automatica, vol. 38, no. 1, pp. 3–20, 2002
work page 2002
-
[33]
Approximation by superpositions of a sigmoidal function,
G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989
work page 1989
-
[34]
Asymptotic behavior of optimal solutions in stochastic programming,
A. Shapiro, “Asymptotic behavior of optimal solutions in stochastic programming,” Mathematics of Operations Research , vol. 18, no. 4, pp. 829–845, 1993
work page 1993
-
[35]
A. Shapiro and T. Homem-de Mello, “On the rate of convergence of optimal solutions of Monte Carlo approximations of stochastic programs,” SIAM journal on optimization , vol. 11, no. 1, pp. 70–86, 2000
work page 2000
-
[36]
R. I. Oliveira and P. Thompson, “Sample average approximation with heavier tails i: non-asymptotic bounds with weak assumptions and stochastic constraints,” Mathematical Programming, pp. 1–48, 2022
work page 2022
-
[37]
Exponential decay in the sensitivity analysis of nonlinear dynamic programming,
S. Na and M. Anitescu, “Exponential decay in the sensitivity analysis of nonlinear dynamic programming,” SIAM Journal on Optimization , vol. 30, no. 2, pp. 1527–1554, 2020
work page 2020
-
[38]
Exponential decay of sensitivity in graph-structured nonlinear programs,
S. Shin, M. Anitescu, and V . M. Zavala, “Exponential decay of sensitivity in graph-structured nonlinear programs,” SIAM Journal on Optimization, 2022
work page 2022
-
[39]
Perturbation- based regret analysis of predictive control in linear time varying systems,
Y . Lin, Y . Hu, G. Shi, H. Sun, G. Qu, and A. Wierman, “Perturbation- based regret analysis of predictive control in linear time varying systems,” Advances in Neural Information Processing Systems , vol. 34, 2021
work page 2021
-
[40]
Mastering the game of Go without human knowledge,
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017
work page 2017
-
[41]
Don't Unroll Adjoint: Differentiating SSA-Form Programs
M. Innes, “Don’t unroll adjoint: Differentiating SSA-form programs,” CoRR, vol. abs/1810.07951, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
JAX: composable transformations of Python+ NumPy programs,
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman- Milne et al., “JAX: composable transformations of Python+ NumPy programs,” Version 0.2, vol. 5, pp. 14–24, 2018
work page 2018
-
[43]
Automatic differentiation in PyTorch,
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” 2017
work page 2017
-
[44]
Graph-Based Modeling and Decomposition of Energy Infrastructures
S. Shin, C. Coffrin, K. Sundar, and V . M. Zavala, “Graph-based modeling and decomposition of energy infrastructures,” arXiv preprint arXiv:2010.02404, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[45]
Forward-Mode Automatic Differentiation in Julia
J. Revels, M. Lubin, and T. Papamarkou, “Forward-mode automatic differentiation in Julia,” arXiv:1607.07892 [cs.MS], 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[46]
NLPModels.jl: Data structures for optimization models,
D. Orban, A. S. Siqueira, and contributors, “NLPModels.jl: Data structures for optimization models,” https://github.com/ JuliaSmoothOptimizers/NLPModels.jl, July 2020
work page 2020
-
[47]
Available: https://github.com/sshin23/con-pol-opt-code
[Online]. Available: https://github.com/sshin23/con-pol-opt-code
-
[48]
J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design . Nob Hill Publishing Madison, WI, 2017, vol. 2
work page 2017
-
[49]
Stochastic model predictive control: An overview and perspectives for future research,
A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,” IEEE Control Systems Magazine , vol. 36, no. 6, pp. 30–44, 2016
work page 2016
-
[50]
Stability properties of multi-stage nonlinear model predictive control,
S. Lucia, S. Subramanian, D. Limon, and S. Engell, “Stability properties of multi-stage nonlinear model predictive control,” Systems & Control Letters, vol. 143, p. 104743, 2020
work page 2020
-
[51]
Scenario-based model predictive control of stochastic constrained linear systems,
D. Bernardini and A. Bemporad, “Scenario-based model predictive control of stochastic constrained linear systems,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference . IEEE, 2009, pp. 6333–6338. Government License: The submitted manuscript has been cre- ated by UChicago Argonne, L...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.