GATO: GPU-Accelerated and Batched Trajectory Optimization for Scalable Edge Model Predictive Control
Pith reviewed 2026-05-18 08:29 UTC · model grok-4.3
The pith
GATO delivers real-time batched nonlinear trajectory optimization on GPU for moderate batch sizes in model predictive control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GATO is an open-source GPU-accelerated batched trajectory optimization solver that combines block-, warp-, and thread-level parallelism to achieve real-time throughput for moderate batch sizes of nonlinear solves. It reports speedups of 18-21x over CPU baselines and 1.4-16x over other GPU baselines as batch size grows, together with better disturbance rejection and convergence, and is validated on physical hardware.
What carries the argument
Multi-level (block, warp, thread) parallelism applied within and across solves in a batched nonlinear trajectory optimization framework.
If this is right
- Real-time model predictive control becomes practical on edge hardware for tasks that require simultaneous optimization of tens to low hundreds of trajectories.
- Solver throughput improves with larger batch sizes, enabling better scalability in applications that benefit from multiple parallel plans.
- Improved disturbance rejection and convergence rates are observed in simulated and hardware case studies.
- The open-source release allows direct reproduction and integration into existing robotics control stacks.
Where Pith is reading between the lines
- The same multi-level parallelism strategy could be adapted to other batch optimization problems in robotics such as motion planning or parameter estimation.
- Faster per-batch solve times may allow MPC to operate at higher replanning frequencies or with more complex dynamics models on the same hardware.
- Energy use on embedded platforms could decrease because shorter computation windows leave more time in low-power states.
Load-bearing premise
That combining block-, warp-, and thread-level parallelism on the GPU produces no prohibitive synchronization costs and preserves generality for nonlinear problems at moderate batch sizes.
What would settle it
A benchmark run at batch sizes of 50-200 where GATO either falls below real-time rates, shows no speedup over a tuned CPU solver, or loses solution accuracy for the same nonlinear models.
Figures
read the original abstract
While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches either parallelize single solves, handle large batches at sub-real-time rates, or sacrifice model generality for speed. This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GATO, an open-source GPU-accelerated batched trajectory optimization solver for model predictive control targeting moderate batch sizes of tens to low hundreds of solves. It claims to fill a performance gap by co-designing algorithm, software, and hardware with combined block-, warp-, and thread-level parallelism, reporting empirical speedups of 18-21x over CPU baselines and 1.4-16x over other GPU baselines, along with case studies on disturbance rejection and hardware validation on an industrial manipulator.
Significance. If the throughput claims hold with adequate analysis of overheads, the work would meaningfully advance real-time nonlinear MPC on edge hardware by supporting moderate batch regimes without sacrificing model generality. The open-sourcing for reproducibility and the inclusion of hardware experiments are clear strengths that enhance the practical value of the contribution.
major comments (1)
- Abstract: the central performance claims (18-21x CPU and 1.4-16x GPU speedups for moderate batch sizes) rest on the assumption that block-, warp-, and thread-level parallelism can be combined without prohibitive synchronization overhead or loss of generality for nonlinear TO. The abstract does not detail how data-dependent operations such as dynamics linearization or line search are scheduled to avoid warp divergence and cross-warp barriers in this batch-size regime, which is load-bearing for the real-time throughput assertion.
minor comments (2)
- The description of the GPU baselines would benefit from explicit statement of their batch-size scaling behavior and whether they also target moderate regimes, to strengthen the comparative claims.
- Consider adding a short table or paragraph summarizing the specific robot models, horizon lengths, and constraint types used in the simulated benchmarks to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the practical value of GATO, including the open-source release and hardware validation. We address the single major comment below and outline a targeted revision to the manuscript.
read point-by-point responses
-
Referee: Abstract: the central performance claims (18-21x CPU and 1.4-16x GPU speedups for moderate batch sizes) rest on the assumption that block-, warp-, and thread-level parallelism can be combined without prohibitive synchronization overhead or loss of generality for nonlinear TO. The abstract does not detail how data-dependent operations such as dynamics linearization or line search are scheduled to avoid warp divergence and cross-warp barriers in this batch-size regime, which is load-bearing for the real-time throughput assertion.
Authors: We appreciate the referee highlighting the need for greater clarity on this point. The abstract is intentionally high-level to summarize the contribution and results. The detailed co-design of block-, warp-, and thread-level parallelism, along with the scheduling of data-dependent operations (dynamics linearization, line search, etc.) to control warp divergence and synchronization costs, is described in Sections 3 and 4 of the manuscript. Our implementation uses uniform batch processing, warp-level primitives for reductions, and kernel structures that minimize cross-warp barriers while preserving full nonlinear model generality. The reported speedups are measured end-to-end and already incorporate all overheads, as shown in the scaling benchmarks of Section 5. To make the abstract more self-contained and directly address the referee's concern, we will revise it to include a concise statement on the scheduling approach for data-dependent operations. revision: yes
Circularity Check
No significant circularity in empirical performance claims
full rationale
The paper presents GATO as an implemented co-designed GPU solver for batched nonlinear trajectory optimization and supports its real-time throughput claims exclusively through direct empirical benchmarks (speedups of 18-21x over CPU and 1.4-16x over other GPU baselines) plus hardware validation on an industrial manipulator. These are measured outcomes from simulated and physical tests rather than any derived predictions, first-principles results, or fitted parameters that reduce to the inputs by construction. No equations, uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing steps in the provided claims; the central argument rests on the observed behavior of the block/warp/thread parallelism implementation itself, which is externally falsifiable via the open-sourced code and independent re-runs. The derivation chain is therefore self-contained in the engineering and benchmarking methodology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nonlinear trajectory optimization problems in MPC can be solved reliably with standard numerical methods when sufficient compute is available.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance... batched PCG... symmetric stair preconditioner
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GATO... real-time throughput for moderate batch sizes of tens to low-hundreds of solves
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Vectorizing Projection in Manifold-Constrained Motion Planning for Real-Time Whole-Body Control
Vectorizing projection operations enables real-time manifold-constrained motion planning for humanoid robots with 100-1000x speedups over prior methods.
Reference graph
Works this paper leans on
-
[1]
Reactive planar manipula- tion with convex hybrid mpc,
F. R. Hogan, E. R. Grau, and A. Rodriguez, “Reactive planar manipula- tion with convex hybrid mpc,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 247–253
work page 2018
-
[2]
A unified mpc framework for whole-body dynamic locomotion and manipula- tion,
J.-P. Sleiman, F. Farshidian, M. V . Minniti, and M. Hutter, “A unified mpc framework for whole-body dynamic locomotion and manipula- tion,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4688– 4695, 2021
work page 2021
-
[3]
Cerberus in the darpa subterranean challenge,
M. Tranzatto, T. Miki, M. Dharmadhikari, L. Bernreiter, M. Kulkarni, F. Mascarich, O. Andersson, S. Khattak, M. Hutter, R. Siegwart,et al., “Cerberus in the darpa subterranean challenge,”Science Robotics, vol. 7, no. 66, p. eabp9742, 2022
work page 2022
-
[4]
Optimization-based control for dynamic legged robots,
P. M. Wensing, M. Posa, Y . Hu, A. Escande, N. Mansard, and A. Del Prete, “Optimization-based control for dynamic legged robots,” IEEE Transactions on Robotics, 2023
work page 2023
-
[5]
Taskable agility: Making useful dynamic behavior easier to create,
S. Kuindersma, “Taskable agility: Making useful dynamic behavior easier to create,” Princeton Robotics Seminar, April 2023
work page 2023
-
[6]
J. T. Betts,Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2010
work page 2010
-
[7]
Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot,
S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Per- menter, T. Koolen, P. Marion, and R. Tedrake, “Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot,”Autonomous robots, vol. 40, pp. 429–455, 2016
work page 2016
-
[8]
H. Li and P. M. Wensing, “Cafe-mpc: A cascaded-fidelity model predictive control framework with tuning-free whole-body control,” arXiv preprint arXiv:2403.03995, 2024
-
[9]
Tinympc: Model-predictive control on resource-constrained micro- controllers,
K. Nguyen, S. Schoedel, A. Alavilli, B. Plancher, and Z. Manchester, “Tinympc: Model-predictive control on resource-constrained micro- controllers,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 1–7
work page 2024
-
[10]
Model predictive path integral control: From theory to parallel computation,
G. Williams, A. Aldrich, and E. A. Theodorou, “Model predictive path integral control: From theory to parallel computation,”Journal of Guidance, Control, and Dynamics, vol. 40, no. 2, pp. 344–357, 2017
work page 2017
-
[11]
Mppi- generic: A cuda library for stochastic trajectory optimization,
B. Vlahov, J. Gibson, M. Gandhi, and E. A. Theodorou, “Mppi- generic: A cuda library for stochastic trajectory optimization,”arXiv preprint arXiv:2409.07563, 2024
-
[12]
Full-order sampling-based mpc for torque-level locomotion control via diffusion-style annealing,
H. Xue, C. Pan, Z. Yi, G. Qu, and G. Shi, “Full-order sampling-based mpc for torque-level locomotion control via diffusion-style annealing,” arXiv preprint arXiv:2409.15610, 2024
-
[13]
Real-time whole-body control of legged robots with model- predictive path integral control,
J. Alvarez-Padilla, J. Z. Zhang, S. Kwok, J. M. Dolan, and Z. Manch- ester, “Real-time whole-body control of legged robots with model- predictive path integral control,”arXiv preprint arXiv:2409.10469, 2024
-
[14]
Comparison of nmpc and gpu- parallelized mppi for real-time uav control on embedded hardware,
R. Enrico, M. Mancini, and E. Capello, “Comparison of nmpc and gpu- parallelized mppi for real-time uav control on embedded hardware,” Applied Sciences, vol. 15, no. 16, p. 9114, 2025
work page 2025
-
[15]
A performance analysis of parallel differential dynamic programming on a gpu,
B. Plancher and S. Kuindersma, “A performance analysis of parallel differential dynamic programming on a gpu,” inProceedings of the 13th Workshop on the Algorithmic F oundations of Robotics. Springer, 2018, pp. 656–672
work page 2018
-
[16]
Gpu-based contact-aware trajectory optimization using a smooth force model,
Z. Pan, B. Ren, and D. Manocha, “Gpu-based contact-aware trajectory optimization using a smooth force model,” inProceedings of the 18th annual ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2019, pp. 1–12
work page 2019
-
[17]
Y . Lee, M. Cho, and K.-S. Kim, “Gpu-parallelized iterative lqr with input constraints for fast collision avoidance of autonomous vehicles,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 4797–4804
work page 2022
-
[18]
Exploit- ing gpu/simd architectures for solving linear-quadratic mpc problems,
D. Cole, S. Shin, F. Pacaud, V . M. Zavala, and M. Anitescu, “Exploit- ing gpu/simd architectures for solving linear-quadratic mpc problems,” in2023 American Control Conference (ACC). IEEE, 2023, pp. 3995– 4000
work page 2023
-
[19]
S. Shin, F. Pacaud, and M. Anitescu, “Accelerating optimal power flow with gpus: Simd abstraction of nonlinear programs and condensed- space interior-point methods,”arXiv preprint arXiv:2307.16830, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Curobo: Parallelized collision-free robot motion generation,
B. Sundaralingam, S. K. S. Hari, A. Fishman, C. Garrett, K. Van Wyk, V . Blukis, A. Millane, H. Oleynikova, A. Handa, F. Ramos,et al., “Curobo: Parallelized collision-free robot motion generation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8112–8119
work page 2023
-
[21]
E. Adabag, M. Atal, W. Gerard, and B. Plancher, “Mpcgpu: Real-time nonlinear model predictive control through preconditioned conjugate gradient on the gpu,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 9787–9794
work page 2024
-
[22]
Y . Lee, K. H. Choi, and K.-S. Kim, “Gpu-enabled parallel trajectory optimization framework for safe motion planning of autonomous vehicles,”IEEE Robotics and Automation Letters, 2024
work page 2024
-
[23]
Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,
S. H. Jeon, S. Hong, H. J. Lee, C. Khazoom, and S. Kim, “Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,”IEEE Robotics and Automation Letters, 2024
work page 2024
-
[24]
Relu-qp: A gpu-accelerated quadratic programming solver for model-predictive control,
A. L. Bishop, J. Z. Zhang, S. Gurumurthy, K. Tracy, and Z. Manch- ester, “Relu-qp: A gpu-accelerated quadratic programming solver for model-predictive control,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 13 285–13 292
work page 2024
-
[25]
On the differentiability of the primal- dual interior-point method,
K. Tracy and Z. Manchester, “On the differentiability of the primal- dual interior-point method,”arXiv preprint arXiv:2406.11749, 2024
-
[26]
Primal-dual ilqr for gpu-accelerated learning and control in legged robots,
L. Amatucci, J. Sousa-Pinto, G. Turrisi, D. Orban, V . Barasuol, and C. Semini, “Primal-dual ilqr for gpu-accelerated learning and control in legged robots,”arXiv preprint arXiv:2506.07823, 2025
-
[27]
Incomplete-lu and cholesky preconditioned iterative methods using cusparse and cublas,
M. Naumov, “Incomplete-lu and cholesky preconditioned iterative methods using cusparse and cublas,”Nvidia white paper, vol. 3, 2011
work page 2011
-
[28]
Gpu acceleration of admm for large-scale quadratic programming,
M. Schubiger, G. Banjac, and J. Lygeros, “Gpu acceleration of admm for large-scale quadratic programming,”Journal of Parallel and Distributed Computing, vol. 144, pp. 55–67, 2020
work page 2020
-
[29]
Accelerating robot dynamics gradients on a cpu, gpu, and fpga,
B. Plancher, S. M. Neuman, T. Bourgeat, S. Kuindersma, S. Devadas, and V . J. Reddi, “Accelerating robot dynamics gradients on a cpu, gpu, and fpga,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2335–2342, 2021
work page 2021
-
[30]
Grid: Gpu-accelerated rigid body dynamics with analytical gradients,
B. Plancher, S. M. Neuman, R. Ghosal, S. Kuindersma, and V . J. Reddi, “Grid: Gpu-accelerated rigid body dynamics with analytical gradients,” in2022 International Conference on Robotics and Automa- tion (ICRA). IEEE, 2022, pp. 6253–6260
work page 2022
-
[31]
Accelerating condensed interior-point methods on simd/gpu architec- tures,
F. Pacaud, S. Shin, M. Schanen, D. A. Maldonado, and M. Anitescu, “Accelerating condensed interior-point methods on simd/gpu architec- tures,”Journal of Optimization Theory and Applications, pp. 1–20, 2023
work page 2023
-
[32]
Fast generation of collision- free trajectories for robot swarms using gpu acceleration,
M. Hamer, L. Widmer, and R. D’andrea, “Fast generation of collision- free trajectories for robot swarms using gpu acceleration,”IEEE Access, vol. 7, pp. 6679–6690, 2018
work page 2018
-
[33]
D. Guhathakurta, F. Rastgar, M. A. Sharma, K. M. Krishna, and A. K. Singh, “Fast joint multi-robot trajectory optimization by gpu accelerated batch solution of distributed sub-problems,”Frontiers in Robotics and AI, vol. 9, p. 890385, 2022
work page 2022
-
[34]
Gpu accelerated batch trajectory optimization for autonomous navi- gation,
F. Rastgar, H. Masnavi, K. Kruusam ¨ae, A. Aabloo, and A. K. Singh, “Gpu accelerated batch trajectory optimization for autonomous navi- gation,” in2023 American Control Conference (ACC). IEEE, 2023, pp. 718–725
work page 2023
-
[35]
Gait optimization for legged systems through mixed distribution cross-entropy optimization,
I. Tsikelis and K. Chatzilygeroudis, “Gait optimization for legged systems through mixed distribution cross-entropy optimization,” in 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids). IEEE, 2024, pp. 1011–1018
work page 2024
-
[36]
Risk-averse model predictive control for racing in adverse conditions,
T. Lew, M. Greiff, F. Djeumou, M. Suminaka, M. Thompson, and J. Subosits, “Risk-averse model predictive control for racing in adverse conditions,”arXiv preprint arXiv:2410.17183, 2024
- [37]
-
[38]
A. W ¨achter and L. T. Biegler, “On the implementation of an interior- point filter line-search algorithm for large-scale nonlinear program- ming,”Mathematical programming, vol. 106, pp. 25–57, 2006
work page 2006
-
[39]
Snopt: An sqp algorithm for large-scale constrained optimization,
P. E. Gill, W. Murray, and M. A. Saunders, “Snopt: An sqp algorithm for large-scale constrained optimization,”SIAM review, vol. 47, no. 1, pp. 99–131, 2005
work page 2005
-
[40]
Symmetric stair preconditioning of linear sys- tems for parallel trajectory optimization,
X. Bu and B. Plancher, “Symmetric stair preconditioning of linear sys- tems for parallel trajectory optimization,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 9779–9786
work page 2024
-
[41]
Osqp: An operator splitting solver for quadratic programs,
B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “Osqp: An operator splitting solver for quadratic programs,”Mathematical Programming Computation, vol. 12, no. 4, pp. 637–672, 2020
work page 2020
-
[42]
J. Carpentier, G. Saurel, G. Buondonno, J. Mirabel, F. Lamiraux, O. Stasse, and N. Mansard, “The pinocchio c++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives,” in2019 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2019, pp. 614–619
work page 2019
-
[43]
High- frequency nonlinear model predictive control of a manipulator,
S. Kleff, A. Meduri, R. Budhiraja, N. Mansard, and L. Righetti, “High- frequency nonlinear model predictive control of a manipulator,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 7330–7336
work page 2021
-
[44]
Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization
M. K. Transtrum and J. P. Sethna, “Improvements to the levenberg- marquardt algorithm for nonlinear least-squares minimization,” 2012. [Online]. Available: https://arxiv.org/abs/1201.5885
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[45]
Predictive sampling: Real-time behaviour synthesis with mujoco,
T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y . Tassa, “Predictive sampling: Real-time behaviour synthesis with mujoco,”arXiv preprint arXiv:2212.00541, 2022
-
[46]
Bundled gradients through contact via randomized smoothing,
H. J. T. Suh, T. Pang, and R. Tedrake, “Bundled gradients through contact via randomized smoothing,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4000–4007, 2022
work page 2022
-
[47]
Cacto-sl: Using sobolev learning to improve continuous actor-critic with trajectory optimization,
E. Alboni, G. Grandesso, G. P. R. Papini, J. Carpentier, and A. Del Prete, “Cacto-sl: Using sobolev learning to improve continuous actor-critic with trajectory optimization,” in6th Annual Learning for Dynamics & Control Conference. PMLR, 2024, pp. 1452–1463
work page 2024
-
[48]
Warm start of mixed-integer programs for model predictive control of hybrid systems,
T. Marcucci and R. Tedrake, “Warm start of mixed-integer programs for model predictive control of hybrid systems,”IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2433–2448, 2020
work page 2020
-
[49]
M. Ditty, “Nvidia orin system-on-chip,” in2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 2022, pp. 1–17
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.