Rollout Then Optimize: A One-Step Newton Refinement of Learned Policies for Nonlinear Model Predictive Control
Pith reviewed 2026-05-22 21:32 UTC · model grok-4.3
The pith
One Newton step on a learned policy rollout reduces suboptimality quadratically to the MPC solution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A learned policy supplies a nominal trajectory that is refined online by one Newton step implemented via Riccati recursion within the MPC scheme. This produces bounds on the policy approximation error relative to the MPC policy and reduces the suboptimality of the learned rollout quadratically in that error, at minimal added computational cost.
What carries the argument
The one Newton refinement step applied to the learned policy rollout, performed via Riccati recursion inside the MPC optimization.
If this is right
- The refined policy achieves performance close to a fully converged MPC solution on the tested task.
- The method requires roughly half the computational time of full MPC while staying close in cost.
- The quadratic error reduction holds when the learned rollout's approximation error is sufficiently small.
- The controller can be deployed online at runtime using only the learned policy plus one Riccati solve.
Where Pith is reading between the lines
- Systems with reliable models could use coarse learned policies and still reach near-optimal closed-loop behavior after the single correction.
- The same one-step structure might be applied to other iterative solvers where the first update already captures most of the gain.
- Pre-factoring the Riccati equations offline could further reduce online latency on embedded hardware.
Load-bearing premise
The system model is known and accurate enough that a single Riccati-based Newton step on the learned rollout realizes the quadratic reduction in suboptimality.
What would settle it
In the quadcopter trajectory-tracking simulation, measure suboptimality after the Newton step across a range of learned-policy errors; if the reduction is not quadratic or the refined cost does not approach the fully converged MPC cost, the central claim fails.
Figures
read the original abstract
We propose a computationally efficient rollout-then-optimize method to improve a learned control policy at deployment time. A learned policy provides a nominal trajectory, which is refined online by a single Newton step implemented via a Riccati recursion within a model predictive control (MPC) scheme. This refinement combines model knowledge with the learned policy at minimal additional computational cost. We establish bounds on the approximation error of the learned policy relative to the MPC policy and show that one Newton step reduces the suboptimality of the learned rollout quadratically in the policy approximation error. The proposed controller is validated in simulation on a constrained trajectory-tracking task for a quadcopter with nonlinear dynamics. Results highlight that the Newton step significantly improves the learned policy, achieving performance close to a fully converged MPC solution while requiring roughly half of the computational time. The code is available at https://github.com/aghezz1/rl-riccati.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a rollout-then-optimize scheme for nonlinear MPC in which a learned policy supplies a nominal trajectory that is refined online by a single Newton step realized via Riccati recursion. The authors derive bounds on the approximation error between the learned policy and the MPC solution and claim that this one-step refinement reduces suboptimality quadratically in the policy error. The approach is tested in simulation on a constrained quadcopter trajectory-tracking problem with nonlinear dynamics, where the refined controller approaches the performance of a fully converged MPC solver while using roughly half the computation time. Reproducible code is provided.
Significance. If the local quadratic-reduction claim is placed on a firm footing, the work supplies a practical, low-overhead mechanism for injecting model knowledge into learned policies at deployment time. The combination of an explicit error bound with a Riccati-based Newton step and the open-source implementation are concrete strengths that would be of interest to the MPC and learning-based control communities.
major comments (2)
- [§3] §3 (theoretical results): The manuscript establishes an a-priori bound on the policy approximation error and asserts quadratic reduction of suboptimality after one Newton step. However, quadratic convergence of Newton’s method on the constrained nonlinear program requires the initial rollout to lie inside the basin of attraction whose radius depends on the Lipschitz constant of the Hessian and the constraint qualification at the optimum. The paper does not show that the derived error bound is smaller than this (problem-dependent) radius, nor does it verify the condition numerically for the quadcopter dynamics and state/input constraints.
- [§4] §4 (quadcopter example): The simulation results report that the Newton-refined policy achieves performance “close to” fully converged MPC. Because the quadratic-reduction guarantee is local, it is necessary to report the actual policy error norm and to confirm that this error lies inside the estimated basin of attraction for the chosen horizon and constraint set; without such a check the observed improvement could be linear rather than quadratic.
minor comments (2)
- Notation: the symbol used for the learned policy should be introduced once and used consistently; occasional reuse of the same letter for the MPC policy creates ambiguity in the error-bound statements.
- Figure 3: the caption should state the exact number of Newton iterations performed by the baseline MPC solver so that the reported timing comparison is unambiguous.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We agree that the local character of quadratic Newton convergence merits explicit verification and will strengthen the manuscript accordingly. Below we address each major comment.
read point-by-point responses
-
Referee: [§3] §3 (theoretical results): The manuscript establishes an a-priori bound on the policy approximation error and asserts quadratic reduction of suboptimality after one Newton step. However, quadratic convergence of Newton’s method on the constrained nonlinear program requires the initial rollout to lie inside the basin of attraction whose radius depends on the Lipschitz constant of the Hessian and the constraint qualification at the optimum. The paper does not show that the derived error bound is smaller than this (problem-dependent) radius, nor does it verify the condition numerically for the quadcopter dynamics and state/input constraints.
Authors: We acknowledge that the quadratic-reduction claim is local and that the manuscript does not explicitly compare the derived policy-error bound to the (problem-dependent) radius of the Newton basin. In the revision we will add a dedicated paragraph in §3 that recalls the standard basin-radius estimate from constrained Newton theory and then, for the quadcopter example, numerically evaluate both the observed policy-error norm and a conservative estimate of the basin radius (via the Lipschitz constant of the Hessian along the trajectory and the constraint qualification). This will either confirm that the error lies inside the basin or qualify the claim as “quadratic when the bound is smaller than the radius.” revision: yes
-
Referee: [§4] §4 (quadcopter example): The simulation results report that the Newton-refined policy achieves performance “close to” fully converged MPC. Because the quadratic-reduction guarantee is local, it is necessary to report the actual policy error norm and to confirm that this error lies inside the estimated basin of attraction for the chosen horizon and constraint set; without such a check the observed improvement could be linear rather than quadratic.
Authors: We agree that reporting only qualitative closeness is insufficient once locality is emphasized. In the revised §4 we will (i) tabulate the policy-error norm (in the appropriate norm) for each tested initial condition, (ii) provide the numerical basin-radius estimate obtained from the same data, and (iii) add a short convergence-rate diagnostic (log-log plot of suboptimality versus policy error) that visually distinguishes linear from quadratic reduction. These additions will be accompanied by the corresponding code in the repository. revision: yes
Circularity Check
No significant circularity; derivation relies on external learned policy and standard Newton analysis
full rationale
The paper treats the learned policy as an independent external input that supplies a nominal trajectory. The one-step Newton refinement is implemented via Riccati recursion inside an MPC scheme whose convergence properties are analyzed using standard local quadratic convergence results for Newton's method under the assumption that the initial error lies inside the basin of attraction. No equation reduces the claimed error bound or quadratic suboptimality reduction to a fitted parameter or to a self-citation chain; the bounds are derived from the problem data and the learned policy's approximation error, which is not defined by the refinement itself. The quadcopter validation uses simulation data independent of the theoretical derivation. This satisfies the criteria for a self-contained, non-circular result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption System dynamics permit implementation of a Newton step via Riccati recursion inside MPC
Reference graph
Works this paper leans on
-
[1]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
J. B. Rawlings, D. Q. Mayne, and M. M. Diehl, Model Predictive Control: Theory, Computation, and Design , 2nd ed. Nob Hill, 2017
work page 2017
-
[3]
Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,
R. Reiter, J. Hoffmann, D. Reinhardt, F. Messerer, K. Baumgaertner, S. Sawant, J. Boedecker, M. Diehl, and S. Gros, “Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,” arXiv preprint arXiv:2502.02133 , 2025
-
[4]
AC4MPC: Actor-critic reinforcement learning for nonlinear model predictive control,
R. Reiter, A. Ghezzi, K. Baumg ¨artner, J. Hoffmann, R. D. McAllister, and M. Diehl, “AC4MPC: Actor-critic reinforcement learning for nonlinear model predictive control,” arXiv preprint arXiv:2406.03995 , 2024
-
[5]
Convex neural network-based cost modifications for learning model predictive control,
K. Seel, A. B. Kordabad, S. Gros, and J. T. Gravdahl, “Convex neural network-based cost modifications for learning model predictive control,” IEEE Open Journal of Control Systems , 2022
work page 2022
-
[6]
S. Abdufattokhov, M. Zanon, and A. Bemporad, “Learning Lyapunov terminal costs from data for complexity reduction in nonlinear model predictive control,” International Journal of Robust and Nonlinear Control, 2024
work page 2024
-
[7]
Stabilizing receding- horizon control of nonlinear time varying systems,
G. De Nicolao, L. Magni, and R. Scattolini, “Stabilizing receding- horizon control of nonlinear time varying systems,” IEEE Trans. Automatic Control, 1998
work page 1998
-
[8]
A stabilizing model-based predictive control for nonlinear systems,
L. Magni, G. De Nicolao, L. Magnani, and R. Scattolini, “A stabilizing model-based predictive control for nonlinear systems,” Automatica, 2001
work page 2001
-
[9]
M. Diehl, L. Magni, and G. D. Nicolao, “Efficient NMPC of unsta- ble periodic systems using approximate infinite horizon closed loop costing,” Annual Reviews in Control , 2004
work page 2004
-
[10]
D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996
work page 1996
-
[11]
Multi-phase optimal control problems for efficient nonlinear model predictive control with acados,
J. Frey, K. Baumg ¨artner, G. Frison, and M. Diehl, “Multi-phase optimal control problems for efficient nonlinear model predictive control with acados,” Optimal Control Applications and Methods , 2025
work page 2025
-
[12]
A partially tightened real-time iteration scheme for nonlinear model predictive control,
A. Zanelli, R. Quirynen, G. Frison, and M. Diehl, “A partially tightened real-time iteration scheme for nonlinear model predictive control,” in Proc. 56th IEEE Conf. Decis. Control , 2017
work page 2017
-
[13]
Inexact methods for nonlinear model predictive control: stability, applications, and software,
A. Zanelli, “Inexact methods for nonlinear model predictive control: stability, applications, and software,” Ph.D. dissertation, Univ. of Freiburg, 2021
work page 2021
-
[14]
K. Baumg ¨artner, A. Zanelli, and M. Diehl, “Stability analysis of nonlinear model predictive control with progressive tightening of stage costs and constraints,” IEEE Control Systems Lett. , 2023
work page 2023
-
[15]
A Lyapunov function for the combined system-optimizer dynamics in inexact model predictive control,
A. Zanelli, Q. Tran-Dinh, and M. Diehl, “A Lyapunov function for the combined system-optimizer dynamics in inexact model predictive control,” Automatica, 2021
work page 2021
-
[16]
A real-time iteration scheme for nonlinear optimization in optimal feedback control,
M. Diehl, H. G. Bock, and J. P. Schl ¨oder, “A real-time iteration scheme for nonlinear optimization in optimal feedback control,” SIAM J. Control Optim. , 2005
work page 2005
-
[17]
J. Nocedal and S. J. Wright, Numerical Optimization , 2nd ed., ser. Operations Research and Financial Eng. Springer-Verlag, 2006
work page 2006
-
[18]
The lifted Newton method and its application in optimization,
J. Albersmeyer and M. Diehl, “The lifted Newton method and its application in optimization,” SIAM J. Optim. , 2010
work page 2010
-
[19]
Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,” IEEE Robotics and Automation Letters , 2022
work page 2022
-
[20]
acados – a modular open-source framework for fast embedded optimal control,
R. Verschueren, G. Frison, D. Kouzoupis, J. Frey, N. van Duijkeren, A. Zanelli, B. Novoselnik, T. Albin, R. Quirynen, and M. Diehl, “acados – a modular open-source framework for fast embedded optimal control,” Math. Program. Comput. , 2021
work page 2021
-
[21]
HPIPM: a high-performance quadratic programming framework for model predictive control,
G. Frison and M. Diehl, “HPIPM: a high-performance quadratic programming framework for model predictive control,” in Proc. IF AC World Congr ., 2020
work page 2020
-
[22]
Design of a Trajectory Tracking Controller for a Nanoquadcopter
C. Luis and J. L. Ny, “Design of a trajectory tracking controller for a nanoquadcopter,” arXiv preprint arXiv:1608.05786 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Learning for casadi: Data-driven models in numerical optimization,
T. Salzmann, J. Arrizabalaga, J. Andersson, M. Pavone, and M. Ryll, “Learning for casadi: Data-driven models in numerical optimization,” in 6th Annual Learning for Dynamics & Control Conference . PMLR, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.