Policy Optimization with Differentiable MPC: Convergence Analysis under Uncertainty
Pith reviewed 2026-05-16 18:11 UTC · model grok-4.3
The pith
Combining gradient-based policy optimization with recursive system identification ensures convergence to an optimal controller design under uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient-based policy optimization combined with recursive system identification ensures convergence to an optimal controller design for differentiable MPC policies, even when the initial dynamical model is uncertain, as demonstrated across multiple control examples.
What carries the argument
Differentiable MPC policies that embed explicit dynamical models and update them through recursive system identification during gradient-based policy optimization.
If this is right
- The resulting MPC controllers achieve optimal performance despite starting with inaccurate models.
- Convergence of the policy parameters occurs reliably as identification improves the embedded dynamics.
- The framework applies across a wide range of control applications where models contain uncertainty.
- Gradient-based updates on the policy remain stable when identification runs concurrently.
Where Pith is reading between the lines
- The method could support adaptive control when system dynamics slowly change over time.
- Hardware tests with sensor noise would reveal the practical limits of real-time identification accuracy.
- Viewing the identification step as part of the policy gradient might link the approach to certain reinforcement learning algorithms.
Load-bearing premise
Recursive system identification can be performed accurately and in real time under the uncertainty levels of the target applications so that the joint optimization reaches the true optimum.
What would settle it
An experiment or simulation in which recursive identification error stays persistently high due to noise or unmodeled dynamics, causing the policy parameters to diverge or converge to a clearly suboptimal controller, would disprove the convergence claim.
read the original abstract
Model-based policy optimization is a well-established framework for designing reliable and high-performance controllers across a wide range of control applications. Recently, this approach has been extended to model predictive control policies, where explicit dynamical models are embedded within the control law. However, the performance of the resulting controllers, and the convergence of the associated optimization algorithms, critically depends on the accuracy of the models. In this paper, we demonstrate that combining gradient-based policy optimization with recursive system identification ensures convergence to an optimal controller design and showcase our finding in several control examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that combining gradient-based policy optimization with recursive system identification in differentiable MPC policies ensures convergence to an optimal controller design under uncertainty. This is asserted via theoretical analysis (with main theorems on convergence) and demonstrated in several control examples.
Significance. If the central convergence result can be established with explicit error bounds that propagate through the differentiable MPC layer, the work would advance model-based control by providing guarantees for adaptive policies in uncertain environments, building on established frameworks with a joint optimization approach.
major comments (2)
- [§3, Theorem 1] §3 (Convergence Analysis), Theorem 1: the derivation treats the recursively identified model as sufficiently accurate for policy gradients to remain valid and reach the true optimum, but supplies no explicit bound on residual identification error or its propagation through the MPC layer; this assumption is load-bearing for the claim under persistent noise or poor excitation.
- [§4.2] §4.2 (Recursive Identification): the scheme is analyzed only in expectation without conditions on persistent excitation or uncertainty levels that would ensure the identified model tracks the true plant closely enough to avoid convergence to a model-dependent local optimum rather than the claimed global one.
minor comments (2)
- [Abstract] The abstract asserts convergence without a proof sketch or list of assumptions; adding a one-sentence statement of the key technical conditions would improve clarity.
- [§2] Notation for the policy gradient through the differentiable MPC (e.g., the chain-rule term) is introduced without an explicit equation reference in the main text, making the exposition harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and assumptions of our convergence analysis. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§3, Theorem 1] §3 (Convergence Analysis), Theorem 1: the derivation treats the recursively identified model as sufficiently accurate for policy gradients to remain valid and reach the true optimum, but supplies no explicit bound on residual identification error or its propagation through the MPC layer; this assumption is load-bearing for the claim under persistent noise or poor excitation.
Authors: Theorem 1 establishes asymptotic convergence of the policy parameters to the optimum under the assumption that the recursive identification error converges to zero in expectation. The proof proceeds by showing that the composite gradient (policy optimization through the differentiable MPC layer) becomes unbiased as identification error vanishes. We agree that the current manuscript does not supply explicit finite-time bounds on residual error propagation; deriving such bounds for general nonlinear systems with differentiable MPC is technically involved and beyond the scope of the present theoretical development. We will add a remark in the revised §3 discussing the role of identification error propagation and citing relevant sensitivity results for MPC layers. revision: partial
-
Referee: [§4.2] §4.2 (Recursive Identification): the scheme is analyzed only in expectation without conditions on persistent excitation or uncertainty levels that would ensure the identified model tracks the true plant closely enough to avoid convergence to a model-dependent local optimum rather than the claimed global one.
Authors: The analysis in §4.2 follows the standard stochastic approximation framework for recursive least-squares identification and establishes convergence in expectation. Persistent excitation is a standard prerequisite for parameter convergence in this literature; we implicitly rely on it in both the theoretical development and the numerical examples. To address the concern about possible convergence to model-dependent local optima, we will insert an explicit statement of the persistent-excitation assumption together with a brief discussion of how it guarantees that the identified model tracks the true plant sufficiently closely for the overall policy to reach the claimed optimum. revision: yes
Circularity Check
No circularity detected; abstract states claim without derivation or self-referential reduction
full rationale
The provided abstract asserts that gradient-based policy optimization combined with recursive system identification ensures convergence to an optimal controller, but contains no equations, theorems, or derivation steps that could be inspected for self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. No specific reduction (e.g., Eq. X equivalent to input by construction) is present or quotable. Per hard rules, absence of inspectable load-bearing steps that collapse to inputs yields score 0; the reader's indeterminate 5.0 reflects the same lack of visible chain rather than any detected circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate that combining gradient-based policy optimization with recursive system identification ensures convergence to an optimal controller design
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.equivNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the solution map y(p) of (30) is unique, locally Lipschitz and definable
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Suboptimal control of linear systems with state and control inequality constraints,
M. Sznaier and M. J. Damborg, “Suboptimal control of linear systems with state and control inequality constraints,” in26th IEEE conference on decision and control, vol. 26. IEEE, 1987, pp. 761–762
work page 1987
-
[2]
A quasi-infinite horizon nonlinear model predictive control scheme with guaranteed stability,
H. Chen and F. Allg ¨ower, “A quasi-infinite horizon nonlinear model predictive control scheme with guaranteed stability,”Automatica, vol. 34, no. 10, pp. 1205–1217, 1998. 0 0.5 𝑒cg Untrained Trained Best −1 0 1 2 ¤𝑒cg −0.2 0 0.2 𝜃e −2 0 2 ¤𝜃e 0 50 100 150 200 250 300 350 −0.5 0 0.5 Time Step 𝑢 Fig. 9. State and input trajectories of different controllers...
work page 1998
-
[3]
Constrained linear quadratic regulation,
P. O. Scokaert and J. B. Rawlings, “Constrained linear quadratic regulation,”IEEE Transactions on automatic control, vol. 43, no. 8, pp. 1163–1169, 1998
work page 1998
-
[4]
Analysis and design of model predictive control frameworks for dynamic operation—An overview,
J. K ¨ohler, M. A. M¨uller, and F. Allg¨ower, “Analysis and design of model predictive control frameworks for dynamic operation—An overview,” Annual Reviews in Control, vol. 57, p. 100929, 2024
work page 2024
-
[5]
Automatic tuning for data-driven model predictive control,
W. Edwards, G. Tang, G. Mamakoukas, T. Murphey, and K. Hauser, “Automatic tuning for data-driven model predictive control,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 7379–7385
work page 2021
-
[6]
F. Sorourifar, G. Makrygirgos, A. Mesbah, and J. A. Paulson, “A data-driven automatic tuning method for MPC under uncertainty using constrained Bayesian optimization,”IFAC-PapersOnLine, vol. 54, no. 3, pp. 243–250, 2021
work page 2021
-
[7]
Performance-driven Constrained Optimal Auto-Tuner for MPC,
A. G. Puigjaner, M. Prajapat, A. Carron, A. Krause, and M. N. Zeilinger, “Performance-driven Constrained Optimal Auto-Tuner for MPC,”IEEE Robotics and Automation Letters, 2025
work page 2025
-
[8]
Optnet: Differentiable optimization as a layer in neural networks,
B. Amos and J. Z. Kolter, “Optnet: Differentiable optimization as a layer in neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 136–145
work page 2017
-
[9]
Differen- tiable mpc for end-to-end planning and control,
B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, “Differen- tiable mpc for end-to-end planning and control,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[10]
BP-MPC: Optimizing the closed-loop performance of MPC using BackPropagation,
R. Zuliani, E. C. Balta, and J. Lygeros, “BP-MPC: Optimizing the closed-loop performance of MPC using BackPropagation,”IEEE Trans- actions on Automatic Control, 2025
work page 2025
-
[11]
Closed-loop performance optimization of model predictive con- trol with robustness guarantees,
——, “Closed-loop performance optimization of model predictive con- trol with robustness guarantees,”European Journal of Control, p. 101319, 2025
work page 2025
-
[12]
Learning convex optimization control policies,
A. Agrawal, S. Barratt, S. Boyd, and B. Stellato, “Learning convex optimization control policies,” inLearning for Dynamics and Control. PMLR, 2020, pp. 361–373
work page 2020
-
[13]
Differentiable robust model predictive control,
A. Oshin, H. Almubarak, and E. A. Theodorou, “Differentiable robust model predictive control,”arXiv preprint arXiv:2308.08426, 2023
-
[14]
J. Drgo ˇna, K. Ki ˇs, A. Tuor, D. Vrabie, and M. Klau ˇco, “Differentiable predictive control: Deep learning alternative to explicit model predictive control for unknown nonlinear systems,”Journal of Process Control, vol. 116, pp. 80–92, 2022
work page 2022
-
[15]
Difftune- mpc: Closed-loop learning for model predictive control,
R. Tao, S. Cheng, X. Wang, S. Wang, and N. Hovakimyan, “Difftune- mpc: Closed-loop learning for model predictive control,”IEEE Robotics and Automation Letters, 2024
work page 2024
-
[16]
Differentiable nonlinear model predictive control.arXiv preprint arXiv:2505.01353,
J. Frey, K. Baumg ¨artner, G. Frison, D. Reinhardt, J. Hoffmann, L. Ficht- ner, S. Gros, and M. Diehl, “Differentiable Nonlinear Model Predictive Control,”arXiv preprint arXiv:2505.01353, 2025
-
[17]
Data-driven economic NMPC using reinforce- ment learning,
S. Gros and M. Zanon, “Data-driven economic NMPC using reinforce- ment learning,”IEEE Transactions on Automatic Control, vol. 65, no. 2, pp. 636–648, 2019
work page 2019
-
[18]
Safe reinforcement learning via projection on a safe set: How to achieve optimality?
S. Gros, M. Zanon, and A. Bemporad, “Safe reinforcement learning via projection on a safe set: How to achieve optimality?”IFAC- PapersOnLine, vol. 53, no. 2, pp. 8076–8081, 2020
work page 2020
-
[19]
Learning for MPC with stability & safety guarantees,
S. Gros and M. Zanon, “Learning for MPC with stability & safety guarantees,”Automatica, vol. 146, p. 110598, 2022
work page 2022
-
[20]
Practical reinforcement learning of stabilizing economic MPC,
M. Zanon, S. Gros, and A. Bemporad, “Practical reinforcement learning of stabilizing economic MPC,” in2019 18th European Control Confer- ence (ECC). IEEE, 2019, pp. 2258–2263
work page 2019
-
[21]
A. V . Fiacco and G. P. McCormick,Nonlinear Programming: Sequential Unconstrained Minimization Techniques. Society for Industrial and Applied Mathematics, Jan. 1968
work page 1968
-
[22]
Jittorntrum,Sequential algorithms in nonlinear programming
K. Jittorntrum,Sequential algorithms in nonlinear programming. The Australian National University (Australia), 1978
work page 1978
-
[23]
Optimal sensitivity based on IPOPT,
H. Pirnay, R. L ´opez-Negrete, and L. T. Biegler, “Optimal sensitivity based on IPOPT,”Mathematical Programming Computation, vol. 4, pp. 307–331, 2012
work page 2012
-
[24]
Sensitivity analysis for nonlinear programming in CasADi,
J. A. Andersson and J. B. Rawlings, “Sensitivity analysis for nonlinear programming in CasADi,”IFAC-PapersOnLine, vol. 51, no. 20, pp. 331– 336, 2018
work page 2018
-
[25]
J. Bolte and E. Pauwels, “Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning,”Mathe- matical Programming, vol. 188, pp. 19–51, 2021
work page 2021
-
[26]
Soft constraints and exact penalty functions in model predictive control,
E. C. Kerrigan and J. M. Maciejowski, “Soft constraints and exact penalty functions in model predictive control,” inControl 2000 Con- ference, Cambridge, 2000, pp. 2319–2327
work page 2000
-
[27]
Improved algorithms for linear stochastic bandits,
Y . Abbasi-Yadkori, D. P´al, and C. Szepesv ´ari, “Improved algorithms for linear stochastic bandits,”Advances in neural information processing systems, vol. 24, 2011
work page 2011
-
[28]
Exponential convergence of recursive least squares with exponential forgetting factor,
R. Johnstone, C. Johnson, R. Bitmead, and B. O. Anderson, “Exponential convergence of recursive least squares with exponential forgetting factor,” in1982 21st IEEE Conference on Decision and Control. IEEE, Dec. 1982, pp. 994–997. [Online]. Available: http://dx.doi.org/10.1109/CDC.1982.268295
-
[29]
Coste,Introduction to o-minimal geometry
M. Coste,Introduction to o-minimal geometry. Rennes, France: Institut de recherche math ´ematique de Rennes (IRMAR), 1999
work page 1999
-
[30]
F. H. Clarke,Optimization and nonsmooth analysis. SIAM, 1990
work page 1990
-
[31]
Stochastic subgradient method converges on tame functions,
D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, “Stochastic subgradient method converges on tame functions,”Foundations of com- putational mathematics, vol. 20, no. 1, pp. 119–154, 2020
work page 2020
-
[32]
R. T. Rockafellar and R. J.-B. Wets,Variational analysis. Springer Science & Business Media, 2009, vol. 317
work page 2009
-
[33]
Dual effect, certainty equivalence, and sep- aration in stochastic control,
Y . Bar-Shalom and E. Tse, “Dual effect, certainty equivalence, and sep- aration in stochastic control,”IEEE Transactions on Automatic Control, vol. 19, no. 5, pp. 494–500, 2003
work page 2003
-
[34]
A sampling-and-discarding approach to chance-constrained optimization: feasibility and optimality,
M. C. Campi and S. Garatti, “A sampling-and-discarding approach to chance-constrained optimization: feasibility and optimality,”Journal of optimization theory and applications, vol. 148, no. 2, pp. 257–280, 2011
work page 2011
-
[35]
Modeling and nonlinear control of a quadcopter for stabilization and trajectory tracking,
A. Abdulkareem, V . Oguntosin, O. M. Popoola, and A. A. Idowu, “Modeling and nonlinear control of a quadcopter for stabilization and trajectory tracking,”Journal of Engineering, vol. 2022, no. 1, p. 2449901, 2022
work page 2022
-
[36]
Automatic steering methods for autonomous automobile path tracking,
J. M. Snideret al., “Automatic steering methods for autonomous automobile path tracking,”Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RITR-09-08, 2009
work page 2009
-
[37]
The Pfaffian closure of an o-minimal structure,
P. Speissegger, “The Pfaffian closure of an o-minimal structure,”J. Reine Angew. Math. 508, pp. 198–211, 1999
work page 1999
-
[38]
Subgradient sampling for nonsmooth nonconvex minimization,
J. Bolte, T. Le, and E. Pauwels, “Subgradient sampling for nonsmooth nonconvex minimization,”SIAM Journal on Optimization, vol. 33, no. 4, pp. 2542–2569, 2023
work page 2023
-
[39]
J. F. Bonnans and A. Shapiro,Perturbation analysis of optimization problems. Springer Science & Business Media, 2013
work page 2013
-
[40]
A. Bagirov, N. Karmitsa, and M. M. M ¨akel¨a,Introduction to Nonsmooth Optimization: theory, practice and software. Springer, 2014, vol. 12
work page 2014
-
[41]
Geometric categories and o-minimal structures,
L. van den Dries and C. Miller, “Geometric categories and o-minimal structures,”Duke Mathematical Journal, vol. 84, no. 2, Aug. 1996. APPENDIX A. Proof of Theorem 2 The core of the proof involves showing that Assumption A in [31] is verified, and then leveraging Theorem 1 in [31]. For completeness, we report the assumption here for an algorithm of the for...
work page 1996
-
[42]
All limit points of{𝑝 𝑘 }lie inP
-
[43]
The iterates are bounded, that is,sup 𝑘≥1 ∥𝑝 𝑘 ∥<∞and sup𝑘≥1 ∥𝑑 𝑘 ∥<∞
-
[44]
Í 𝑘∈N 𝛼𝑘 =∞and Í 𝑘∈N 𝛼2 𝑘 <∞
-
[45]
We start by proving thatJ 𝑘 ¯C represents a “sample” of the true JacobianJ C ofC
For any unbounded increasing sequence{𝑘 𝑗 } ⊂Nsuch that𝑝 𝑘 𝑗 →¯𝑝, it holds lim 𝑛→∞ dist© « 1 𝑛 𝑛∑︁ 𝑗=1 𝑑 𝑘 𝑗 , 𝐺(¯𝑝)ª® ¬ =0. We start by proving thatJ 𝑘 ¯C represents a “sample” of the true JacobianJ C ofC. Lemma 4.Under Assumption 3, the expected costC(𝑝):= E𝑣 [ ¯C(𝑝, 𝑣)]is locally Lipschitz and definable with conserva- tive JacobianJ C (𝑝)=E 𝑣 [J ¯C (𝑝...
-
[46]
Since∇ 2 𝑓𝑘 (𝑝)=2𝐼, for any𝑝, 𝑝 ′ 𝑓𝑘 (𝑝 ′) ≥𝑓 𝑘 (𝑝) + ∇𝑓 𝑘 (𝑝) ⊤(𝑝 ′ −𝑝) + ∥𝑝 ′ −𝑝∥ 2, If𝑝∈arg min 𝑥∈Φ 𝑓𝑘 (𝑥), then∇𝑓 𝑘 (𝑝) ⊤𝑑≥0for all locally feasible directions𝑑≠0, i.e., those𝑑≠0such that there exists some𝜖 >0for which𝑥+𝑡𝑑∈Φ for all𝑡∈ (0, 𝜖][40, Theorem 4.9]. Therefore, if𝑝 ′ is sufficiently close to𝑝, then∇𝑓 𝑘 (𝑝) ⊤ (𝑝−𝑝 ′) ≥0and therefore𝑓satisfies ...
-
[47]
This means that𝑓 𝑘 −𝑔 𝑘 is𝜅-Lipschitz
We have ∥∇𝑓 𝑘 (𝑝) − ∇𝑔 𝑘 (𝑝) ∥ =2∥𝑝−𝑝 𝑘 +𝛼 𝑘 𝐽 𝑘 ¯C (𝜃) −𝑝+𝑝 𝑘 −𝛼 𝑘 𝐽 𝑘 ¯C ∥ ≤2𝛼 𝑘 [∥𝐽 𝑘 ¯C (𝜃) −𝐽 𝑘 ¯C ∥] ≤2𝛼 𝑘 𝐿1 diamΘ 𝑘 =:𝜅, where the last step follows from Lemma 5. This means that𝑓 𝑘 −𝑔 𝑘 is𝜅-Lipschitz. Let𝑆 0 be the set of minimizers of the problemmin 𝑝∈Φ 𝑓𝑘 (𝑝), and let𝑆 1 be the set of minimizers of the problem min 𝑝∈Φ 𝑔 𝑘 (𝑝). Observe that𝛼 𝑘 ∥...
-
[48]
Path-differentiability of quadratic programs:For com- pleteness, we provide sufficient conditions for the path- differentiability of a quadratic program of the form minimize 𝑥 1 2 𝑥⊤𝑄(𝑝)𝑥+𝑞(𝑝) ⊤𝑥 subject to𝐹(𝑝)𝑥=𝑓(𝑝), 𝐺(𝑝)𝑥≤𝑔(𝑝). (30) where𝑝is a parameter. We require the following constraint qualification. Definition 1.Let𝑥be an optimizer of (30). We say ...
-
[49]
Path-differentiability of nonlinear optimization problems: In this section we provide sufficient conditions for the path- differentiability of nonlinear optimization problems, extending the results of Appendix B.1 and allowing the utilization of nonlinear MPC formulations in (9). We consider a parameterized nonlinear programming prob- lem (NLP) in standar...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.