pith. sign in

arxiv: 2504.02710 · v2 · submitted 2025-04-03 · 🧮 math.OC

Rollout Then Optimize: A One-Step Newton Refinement of Learned Policies for Nonlinear Model Predictive Control

Pith reviewed 2026-05-22 21:32 UTC · model grok-4.3

classification 🧮 math.OC
keywords nonlinear model predictive controllearned policiesNewton refinementRiccati recursionquadcopter controltrajectory trackingsuboptimality boundsapproximation error
0
0 comments X

The pith

One Newton step on a learned policy rollout reduces suboptimality quadratically to the MPC solution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a rollout-then-optimize controller that takes a nominal trajectory from a learned policy and refines it online with a single Newton step inside an MPC scheme. The Newton step is computed via Riccati recursion using the known system model, adding only modest computation while incorporating model knowledge. Bounds are derived on the learned policy's approximation error to the true MPC policy, showing that the single refinement reduces suboptimality quadratically in that error. The approach is tested on constrained quadcopter trajectory tracking, where the refined controller reaches performance close to fully converged MPC while using roughly half the time. A reader would care because the method offers a concrete way to deploy fast learned policies on systems whose dynamics are known well enough for one model-based correction step.

Core claim

A learned policy supplies a nominal trajectory that is refined online by one Newton step implemented via Riccati recursion within the MPC scheme. This produces bounds on the policy approximation error relative to the MPC policy and reduces the suboptimality of the learned rollout quadratically in that error, at minimal added computational cost.

What carries the argument

The one Newton refinement step applied to the learned policy rollout, performed via Riccati recursion inside the MPC optimization.

If this is right

  • The refined policy achieves performance close to a fully converged MPC solution on the tested task.
  • The method requires roughly half the computational time of full MPC while staying close in cost.
  • The quadratic error reduction holds when the learned rollout's approximation error is sufficiently small.
  • The controller can be deployed online at runtime using only the learned policy plus one Riccati solve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems with reliable models could use coarse learned policies and still reach near-optimal closed-loop behavior after the single correction.
  • The same one-step structure might be applied to other iterative solvers where the first update already captures most of the gain.
  • Pre-factoring the Riccati equations offline could further reduce online latency on embedded hardware.

Load-bearing premise

The system model is known and accurate enough that a single Riccati-based Newton step on the learned rollout realizes the quadratic reduction in suboptimality.

What would settle it

In the quadcopter trajectory-tracking simulation, measure suboptimality after the Newton step across a range of learned-policy errors; if the reduction is not quadratic or the refined cost does not approach the fully converged MPC cost, the central claim fails.

Figures

Figures reproduced from arXiv: 2504.02710 by Alberto Bemporad, Andrea Ghezzi, Katrin Baumg\"artner, Moritz Diehl, Rudolf Reiter.

Figure 1
Figure 1. Figure 1: Position and trajectory in the xz-plane obtained with different controllers for the easy task where the lemniscate is scaled with α = 0.8. The black dotted lines are the reference, while the dashed red lines the bounds on velocity. T = 4.5 seconds and frequency ρ = 2π T , defined as   p x,o(t) p y,o(t) p z,o(t) v x,o(t) v y,o(t) v z,o(t)   =   α sin(ρt) 1 4 − 1 2 (1 + α sin(ρt… view at source ↗
Figure 2
Figure 2. Figure 2: Easy task - Breakdown of the average closed-loop cost for the considered approaches achieved in 100 episodes. TABLE II EASY TASK - STATISTICS OF MEAN RUNTIME AND SOLVER FAILURES Approach Runtime (ms) Feedback Time (ms) Solver Failures (%) RTI-40 2.28 1.44 7 RTI-20 1.24 0.77 0 PT-10-20-4 1.01 0.39 10 CLC-10-20-1 5.08 0.46 5 CLC-10-20-2 3.32 0.37 5 PEPT-10-20-1 1.32 0.51 0 PEPT-10-20-2 1.18 0.45 0 PEPT-10-20… view at source ↗
Figure 3
Figure 3. Figure 3: Hard task - Breakdown of the average closed-loop cost for the considered approaches achieved in 100 episodes. TABLE III HARD TASK - STATISTICS OF MEAN RUNTIME AND SOLVER FAILURES Approach Runtime (ms) Feedback Time (ms) Solver Failures (%) RTI-40 2.40 1.54 5 RTI-20 1.12 0.71 0 PT-10-20-4 1.00 0.40 12 CLC-10-20-1 3.75 0.49 0 PEPT-10-20-1 1.34 0.52 1 PEPT-10-20-1-r 1.29 0.46 14 PPO-RL 0.19 - - parameter. We … view at source ↗
read the original abstract

We propose a computationally efficient rollout-then-optimize method to improve a learned control policy at deployment time. A learned policy provides a nominal trajectory, which is refined online by a single Newton step implemented via a Riccati recursion within a model predictive control (MPC) scheme. This refinement combines model knowledge with the learned policy at minimal additional computational cost. We establish bounds on the approximation error of the learned policy relative to the MPC policy and show that one Newton step reduces the suboptimality of the learned rollout quadratically in the policy approximation error. The proposed controller is validated in simulation on a constrained trajectory-tracking task for a quadcopter with nonlinear dynamics. Results highlight that the Newton step significantly improves the learned policy, achieving performance close to a fully converged MPC solution while requiring roughly half of the computational time. The code is available at https://github.com/aghezz1/rl-riccati.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a rollout-then-optimize scheme for nonlinear MPC in which a learned policy supplies a nominal trajectory that is refined online by a single Newton step realized via Riccati recursion. The authors derive bounds on the approximation error between the learned policy and the MPC solution and claim that this one-step refinement reduces suboptimality quadratically in the policy error. The approach is tested in simulation on a constrained quadcopter trajectory-tracking problem with nonlinear dynamics, where the refined controller approaches the performance of a fully converged MPC solver while using roughly half the computation time. Reproducible code is provided.

Significance. If the local quadratic-reduction claim is placed on a firm footing, the work supplies a practical, low-overhead mechanism for injecting model knowledge into learned policies at deployment time. The combination of an explicit error bound with a Riccati-based Newton step and the open-source implementation are concrete strengths that would be of interest to the MPC and learning-based control communities.

major comments (2)
  1. [§3] §3 (theoretical results): The manuscript establishes an a-priori bound on the policy approximation error and asserts quadratic reduction of suboptimality after one Newton step. However, quadratic convergence of Newton’s method on the constrained nonlinear program requires the initial rollout to lie inside the basin of attraction whose radius depends on the Lipschitz constant of the Hessian and the constraint qualification at the optimum. The paper does not show that the derived error bound is smaller than this (problem-dependent) radius, nor does it verify the condition numerically for the quadcopter dynamics and state/input constraints.
  2. [§4] §4 (quadcopter example): The simulation results report that the Newton-refined policy achieves performance “close to” fully converged MPC. Because the quadratic-reduction guarantee is local, it is necessary to report the actual policy error norm and to confirm that this error lies inside the estimated basin of attraction for the chosen horizon and constraint set; without such a check the observed improvement could be linear rather than quadratic.
minor comments (2)
  1. Notation: the symbol used for the learned policy should be introduced once and used consistently; occasional reuse of the same letter for the MPC policy creates ambiguity in the error-bound statements.
  2. Figure 3: the caption should state the exact number of Newton iterations performed by the baseline MPC solver so that the reported timing comparison is unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We agree that the local character of quadratic Newton convergence merits explicit verification and will strengthen the manuscript accordingly. Below we address each major comment.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical results): The manuscript establishes an a-priori bound on the policy approximation error and asserts quadratic reduction of suboptimality after one Newton step. However, quadratic convergence of Newton’s method on the constrained nonlinear program requires the initial rollout to lie inside the basin of attraction whose radius depends on the Lipschitz constant of the Hessian and the constraint qualification at the optimum. The paper does not show that the derived error bound is smaller than this (problem-dependent) radius, nor does it verify the condition numerically for the quadcopter dynamics and state/input constraints.

    Authors: We acknowledge that the quadratic-reduction claim is local and that the manuscript does not explicitly compare the derived policy-error bound to the (problem-dependent) radius of the Newton basin. In the revision we will add a dedicated paragraph in §3 that recalls the standard basin-radius estimate from constrained Newton theory and then, for the quadcopter example, numerically evaluate both the observed policy-error norm and a conservative estimate of the basin radius (via the Lipschitz constant of the Hessian along the trajectory and the constraint qualification). This will either confirm that the error lies inside the basin or qualify the claim as “quadratic when the bound is smaller than the radius.” revision: yes

  2. Referee: [§4] §4 (quadcopter example): The simulation results report that the Newton-refined policy achieves performance “close to” fully converged MPC. Because the quadratic-reduction guarantee is local, it is necessary to report the actual policy error norm and to confirm that this error lies inside the estimated basin of attraction for the chosen horizon and constraint set; without such a check the observed improvement could be linear rather than quadratic.

    Authors: We agree that reporting only qualitative closeness is insufficient once locality is emphasized. In the revised §4 we will (i) tabulate the policy-error norm (in the appropriate norm) for each tested initial condition, (ii) provide the numerical basin-radius estimate obtained from the same data, and (iii) add a short convergence-rate diagnostic (log-log plot of suboptimality versus policy error) that visually distinguishes linear from quadratic reduction. These additions will be accompanied by the corresponding code in the repository. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external learned policy and standard Newton analysis

full rationale

The paper treats the learned policy as an independent external input that supplies a nominal trajectory. The one-step Newton refinement is implemented via Riccati recursion inside an MPC scheme whose convergence properties are analyzed using standard local quadratic convergence results for Newton's method under the assumption that the initial error lies inside the basin of attraction. No equation reduces the claimed error bound or quadratic suboptimality reduction to a fitted parameter or to a self-citation chain; the bounds are derived from the problem data and the learned policy's approximation error, which is not defined by the refinement itself. The quadcopter validation uses simulation data independent of the theoretical derivation. This satisfies the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated in the provided text.

axioms (1)
  • domain assumption System dynamics permit implementation of a Newton step via Riccati recursion inside MPC
    Stated in the abstract as the computational mechanism for the refinement.

pith-pipeline@v0.9.0 · 5702 in / 1272 out tokens · 54400 ms · 2026-05-22T21:32:27.899454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

  2. [2]

    J. B. Rawlings, D. Q. Mayne, and M. M. Diehl, Model Predictive Control: Theory, Computation, and Design , 2nd ed. Nob Hill, 2017

  3. [3]

    Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,

    R. Reiter, J. Hoffmann, D. Reinhardt, F. Messerer, K. Baumgaertner, S. Sawant, J. Boedecker, M. Diehl, and S. Gros, “Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,” arXiv preprint arXiv:2502.02133 , 2025

  4. [4]

    AC4MPC: Actor-critic reinforcement learning for nonlinear model predictive control,

    R. Reiter, A. Ghezzi, K. Baumg ¨artner, J. Hoffmann, R. D. McAllister, and M. Diehl, “AC4MPC: Actor-critic reinforcement learning for nonlinear model predictive control,” arXiv preprint arXiv:2406.03995 , 2024

  5. [5]

    Convex neural network-based cost modifications for learning model predictive control,

    K. Seel, A. B. Kordabad, S. Gros, and J. T. Gravdahl, “Convex neural network-based cost modifications for learning model predictive control,” IEEE Open Journal of Control Systems , 2022

  6. [6]

    Learning Lyapunov terminal costs from data for complexity reduction in nonlinear model predictive control,

    S. Abdufattokhov, M. Zanon, and A. Bemporad, “Learning Lyapunov terminal costs from data for complexity reduction in nonlinear model predictive control,” International Journal of Robust and Nonlinear Control, 2024

  7. [7]

    Stabilizing receding- horizon control of nonlinear time varying systems,

    G. De Nicolao, L. Magni, and R. Scattolini, “Stabilizing receding- horizon control of nonlinear time varying systems,” IEEE Trans. Automatic Control, 1998

  8. [8]

    A stabilizing model-based predictive control for nonlinear systems,

    L. Magni, G. De Nicolao, L. Magnani, and R. Scattolini, “A stabilizing model-based predictive control for nonlinear systems,” Automatica, 2001

  9. [9]

    Efficient NMPC of unsta- ble periodic systems using approximate infinite horizon closed loop costing,

    M. Diehl, L. Magni, and G. D. Nicolao, “Efficient NMPC of unsta- ble periodic systems using approximate infinite horizon closed loop costing,” Annual Reviews in Control , 2004

  10. [10]

    Bertsekas and J

    D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996

  11. [11]

    Multi-phase optimal control problems for efficient nonlinear model predictive control with acados,

    J. Frey, K. Baumg ¨artner, G. Frison, and M. Diehl, “Multi-phase optimal control problems for efficient nonlinear model predictive control with acados,” Optimal Control Applications and Methods , 2025

  12. [12]

    A partially tightened real-time iteration scheme for nonlinear model predictive control,

    A. Zanelli, R. Quirynen, G. Frison, and M. Diehl, “A partially tightened real-time iteration scheme for nonlinear model predictive control,” in Proc. 56th IEEE Conf. Decis. Control , 2017

  13. [13]

    Inexact methods for nonlinear model predictive control: stability, applications, and software,

    A. Zanelli, “Inexact methods for nonlinear model predictive control: stability, applications, and software,” Ph.D. dissertation, Univ. of Freiburg, 2021

  14. [14]

    Stability analysis of nonlinear model predictive control with progressive tightening of stage costs and constraints,

    K. Baumg ¨artner, A. Zanelli, and M. Diehl, “Stability analysis of nonlinear model predictive control with progressive tightening of stage costs and constraints,” IEEE Control Systems Lett. , 2023

  15. [15]

    A Lyapunov function for the combined system-optimizer dynamics in inexact model predictive control,

    A. Zanelli, Q. Tran-Dinh, and M. Diehl, “A Lyapunov function for the combined system-optimizer dynamics in inexact model predictive control,” Automatica, 2021

  16. [16]

    A real-time iteration scheme for nonlinear optimization in optimal feedback control,

    M. Diehl, H. G. Bock, and J. P. Schl ¨oder, “A real-time iteration scheme for nonlinear optimization in optimal feedback control,” SIAM J. Control Optim. , 2005

  17. [17]

    Nocedal and S

    J. Nocedal and S. J. Wright, Numerical Optimization , 2nd ed., ser. Operations Research and Financial Eng. Springer-Verlag, 2006

  18. [18]

    The lifted Newton method and its application in optimization,

    J. Albersmeyer and M. Diehl, “The lifted Newton method and its application in optimization,” SIAM J. Optim. , 2010

  19. [19]

    Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,

    Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,” IEEE Robotics and Automation Letters , 2022

  20. [20]

    acados – a modular open-source framework for fast embedded optimal control,

    R. Verschueren, G. Frison, D. Kouzoupis, J. Frey, N. van Duijkeren, A. Zanelli, B. Novoselnik, T. Albin, R. Quirynen, and M. Diehl, “acados – a modular open-source framework for fast embedded optimal control,” Math. Program. Comput. , 2021

  21. [21]

    HPIPM: a high-performance quadratic programming framework for model predictive control,

    G. Frison and M. Diehl, “HPIPM: a high-performance quadratic programming framework for model predictive control,” in Proc. IF AC World Congr ., 2020

  22. [22]

    Design of a Trajectory Tracking Controller for a Nanoquadcopter

    C. Luis and J. L. Ny, “Design of a trajectory tracking controller for a nanoquadcopter,” arXiv preprint arXiv:1608.05786 , 2016

  23. [23]

    Learning for casadi: Data-driven models in numerical optimization,

    T. Salzmann, J. Arrizabalaga, J. Andersson, M. Pavone, and M. Ryll, “Learning for casadi: Data-driven models in numerical optimization,” in 6th Annual Learning for Dynamics & Control Conference . PMLR, 2024