pith. sign in

arxiv: 2605.31481 · v1 · pith:WDTZ6A4Fnew · submitted 2026-05-29 · 💻 cs.RO

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Pith reviewed 2026-06-28 22:18 UTC · model grok-4.3

classification 💻 cs.RO
keywords batched rigid body dynamicsdifferentiable dynamicsGPU accelerationPyTorchrobot learningreinforcement learningforward kinematicssystem identification
0
0 comments X

The pith

BARD delivers batched rigid-body dynamics in PyTorch that match Pinocchio accuracy while running up to 64 times faster on GPU for large robot batches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BARD as a PyTorch library that implements Featherstone's rigid-body dynamics algorithms with GPU batching and automatic differentiation in mind. It relies on a tiered lazy-evaluation cache, matmul-free joint transforms from pre-computed Rodrigues constants, and level-parallel propagation to cut down on sequential work. These choices let the library reach the same numerical results as the CPU library Pinocchio on forward kinematics and Jacobians across five robot models, yet deliver much higher throughput at batch sizes up to 4096. The work also shows the gradients are usable by recovering link masses from noisy torque data and by speeding up an entire reinforcement-learning training loop for a quadruped.

Core claim

BARD implements Featherstone's algorithms in PyTorch so that forward kinematics, Jacobians, and full dynamics can be evaluated in large batches on GPU; the three design choices keep results numerically identical to Pinocchio while producing up to 64x higher throughput for kinematics and 63x for Jacobians at batch size 4096, and the same gradients support system identification to 1.24 percent mean mass error under 5 percent noise as well as 8.5x faster in-loop dynamics inside an 11-DOF quadruped training run with 4096 environments.

What carries the argument

The tiered lazy-evaluation cache that skips redundant tree traversals together with matmul-free joint transforms via pre-computed Rodrigues constants and level-parallel propagation that collapses sequential steps to tree-depth batched operations.

If this is right

  • Robot reinforcement learning can run thousands of parallel environments with full dynamics inside the training loop on a single GPU.
  • Gradient-based system identification becomes practical without leaving the GPU.
  • Existing GPU training frameworks can replace CPU dynamics calls and obtain measured speedups of 2x to 8.5x.
  • Batch sizes of several thousand become usable for models between 7 and 23 degrees of freedom without accuracy loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cache and parallel-propagation pattern could be applied to other tree-structured kinematic or dynamic calculations.
  • Extending the implementation to contact-rich or floating-base models would test whether the speed advantage survives added complexity.
  • Users could measure whether the observed speedups remain when the same code runs on other GPU architectures or at even larger batch sizes.

Load-bearing premise

The caching, transform, and propagation choices keep numerical values and gradient computation correct for every robot model and every batch size that will be used.

What would settle it

Compute forward kinematics and its Jacobian for one of the 7-to-23-DOF test robots at batch size 4096 on GPU, then compare every output entry and every gradient entry against an independent high-precision CPU reference; any difference larger than machine epsilon would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.31481 by Chuanhang Qiu, Wenbo Wu, Yanran Xu, Yue Wang, Zhaoxing Li.

Figure 1
Figure 1. Figure 1: Architecture of bard. A URDF is parsed once into an immutable model stor￾ing tree topology, spatial inertias, and pre-computed constants. All mutable state for B parallel environments resides in a pre-allocated workspace sized at initialisation. Downstream algorithms materialise only the kinematic quantities they require through a tiered lazy evaluation scheme. transforms, level-parallel tree propagation, … view at source ↗
Figure 2
Figure 2. Figure 2: Speedup over Pinocchio (PyTorch) at batch size 4096 on NVIDIA H200. bard consistently outperforms ADAM across all algorithms, with the largest gains in FK (64×) and Jacobian (63×). For dynamics algorithms the speedup is smaller but still consistent, ranging from 3× to 7× depending on the robot and algorithm [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Absolute wall-clock time vs. batch size for the Go2 (12-DOF) on NVIDIA H200. Pinocchio’s execution time scales linearly with batch size, while bard remains nearly flat up to batch size 4096. use hand-written Triton kernels sized for the 6 × 6 updates, avoiding this prob￾lem. In practice, we recommend eager mode for dynamics-heavy workloads on lower-bandwidth GPUs and compiled mode for kinematics-heavy work… view at source ↗
Figure 4
Figure 4. Figure 4: Wall-clock time vs. batch size on NVIDIA H100 80GB HBM3 (Go2, 12-DOF) [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Wall-clock time vs. batch size on NVIDIA A100 80GB PCIe (Go2, 12-DOF) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Wall-clock time vs. batch size on NVIDIA L40S (Go2, 12-DOF) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Wall-clock time vs. batch size on NVIDIA L4 (Go2, 12-DOF) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces BARD, a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms optimized for batched GPU evaluation and automatic differentiation. Key design choices include a tiered lazy-evaluation cache, matmul-free joint transforms using pre-computed Rodrigues constants, and level-parallel propagation. On five robot models (7-23 DOFs), it claims numerical equivalence to Pinocchio with up to 64x throughput for forward kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200; differentiability is validated via gradient-based system identification recovering link masses to 1.24% mean error under noise; and integration into an Isaac Lab AMP pipeline for an 11-DOF quadruped shows 8.5x speedup over Pinocchio and 2.0x over ADAM with 4096 environments.

Significance. If the optimizations preserve numerical accuracy to machine precision and produce correct gradients, the work would address a key throughput bottleneck in GPU-based robot learning pipelines, enabling larger-scale in-loop dynamics for reinforcement learning. The open-source release and concrete integration example add practical value.

major comments (2)
  1. [Abstract] Abstract: the claim that BARD 'matches Pinocchio numerically' for forward kinematics and Jacobians on five models is load-bearing for the central accuracy and speedup assertions, yet no quantitative support (maximum absolute errors, per-model residual tables, or error-bar reporting) is provided to confirm equivalence under the tiered cache and level-parallel reordering.
  2. [Abstract] Abstract: the differentiability validation via system identification (1.24% mean mass error) does not address whether the lazy-evaluation cache and level-parallel propagation preserve correct autograd tape construction and gradient values relative to a sequential reference implementation; no gradient-norm comparisons or per-operation checks are mentioned.
minor comments (1)
  1. The throughput numbers (64x, 63x, 8.5x) are reported without benchmark methodology details, hardware configuration beyond 'NVIDIA H200', or variance across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the quantitative support for our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that BARD 'matches Pinocchio numerically' for forward kinematics and Jacobians on five models is load-bearing for the central accuracy and speedup assertions, yet no quantitative support (maximum absolute errors, per-model residual tables, or error-bar reporting) is provided to confirm equivalence under the tiered cache and level-parallel reordering.

    Authors: We agree that the abstract would be strengthened by explicit quantitative metrics. The full manuscript (Section 4.1 and associated tables) reports per-model maximum absolute errors for forward kinematics and Jacobians that remain below 1e-8 (within floating-point precision) when using the tiered cache and level-parallel propagation. We will revise the abstract to include a brief statement of these maximum errors and ensure clear cross-references to the supporting tables and figures. revision: yes

  2. Referee: [Abstract] Abstract: the differentiability validation via system identification (1.24% mean mass error) does not address whether the lazy-evaluation cache and level-parallel propagation preserve correct autograd tape construction and gradient values relative to a sequential reference implementation; no gradient-norm comparisons or per-operation checks are mentioned.

    Authors: The system-identification experiment demonstrates end-to-end correctness through accurate mass recovery, but we concur that direct verification of gradient fidelity under the proposed optimizations is warranted. We will add an appendix containing gradient-norm comparisons and per-operation checks against a sequential reference implementation, confirming that the lazy cache and level-parallel methods preserve both the autograd tape and numerical gradient values. These results will be incorporated into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of GPU implementation against external reference library

full rationale

The paper implements known Featherstone rigid-body algorithms in PyTorch with batched GPU optimizations (lazy cache, Rodrigues constants, level-parallel propagation). All performance and accuracy claims rest on direct numerical comparison to the independent external library Pinocchio plus separate empirical tests (system ID recovering masses, integration into Isaac Lab). No self-referential equations, fitted inputs presented as predictions, self-citation load-bearing, or ansatz smuggling appear in the provided text. The central results are externally falsifiable and do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering implementation and optimization of established Featherstone algorithms using standard PyTorch primitives. No new free parameters, axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5798 in / 1342 out tokens · 39883 ms · 2026-06-28T22:18:49.376476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages

  1. [1]

    In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Bledt, G., Powell, M.J., Katz, B., Di Carlo, J., Wensing, P.M., Kim, S.: Mit cheetah 3: Design and control of a robust, dynamic quadruped robot. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2245–

  2. [2]

    In: Robotics: Science and Systems (RSS) (2018)

    Carpentier, J., Mansard, N.: Analytical derivatives of rigid body dynamics algo- rithms. In: Robotics: Science and Systems (RSS) (2018)

  3. [3]

    In: IEEE Inter- national Symposium on System Integration (SII)

    Carpentier, J., Saurel, G., Buondonno, G., Mirabel, J., Lamiraux, F., Stasse, O., Mansard, N.: The Pinocchio C++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives. In: IEEE Inter- national Symposium on System Integration (SII). pp. 614–619. IEEE (2019)

  4. [4]

    Frontiers in neurorobotics13, 406386 (2019)

    Degrave, J., Hermans, M., Dambre, J., et al.: A differentiable physics engine for deep learning in robotics. Frontiers in neurorobotics13, 406386 (2019)

  5. [5]

    Springer, New York (2008)

    Featherstone, R.: Rigid Body Dynamics Algorithms. Springer, New York (2008)

  6. [6]

    Autonomous Robots41(2), 495–511 (2017)

    Felis, M.L.: RBDL: An efficient rigid-body dynamics library using recursive algo- rithms. Autonomous Robots41(2), 495–511 (2017)

  7. [7]

    In: Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (2021)

    Freeman, C.D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., Bachem, O.: Brax – a differentiable physics engine for large scale rigid body simulation. In: Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (2021)

  8. [8]

    DiffTaichi: Differentiable programming for physical simulation

    Hu, Y., Anderson, L., Li, T.M., Sun, Q., Carr, N., Ragan-Kelley, J., Durand, F.: Difftaichi: Differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935 (2019)

  9. [9]

    Science Robotics 4(26), eaau5872 (2019)

    Hwangbo,J.,Lee,J.,Dosovitskiy,A.,Bellicoso,D.,Tsounis,V.,Koltun,V.,Hutter, M.: Learning agile and dynamic motor skills for legged robots. Science Robotics 4(26), eaau5872 (2019)

  10. [10]

    IEEE Robotics and Automation Letters4(2), 1595–1602 (2019)

    Kashiri, N., Baccelliere, L., Muratore, L., Laurenzi, A., Ren, Z., Hoffman, E.M., Kamedula, M., Rigano, G.F., Malzahn, J., Cordasco, S., Guria, P., Margan, A., Tsagarakis, N.G.: Centauro: A hybrid locomotion and high power resilient manip- ulation platform. IEEE Robotics and Automation Letters4(2), 1595–1602 (2019)

  11. [11]

    IEEE Robotics and Automation Let- ters6(2), 3413–3420 (2021)

    Le Lidec, Q., Kalevatykh, I., Laptev, I., Schmid, C., Carpentier, J.: Differentiable simulation for physical system identification. IEEE Robotics and Automation Let- ters6(2), 3413–3420 (2021)

  12. [12]

    https://github.com/ami-iit/adam (2025)

    L’Erario, G., Traversaro, S., Sartore, C., Grieco, R., Dafarra, S., Romualdi, G., Fer- retti, F.L., Pucci, D.: ADAM: Automatic differentiation for rigid-body-dynamics algorithms and models. https://github.com/ami-iit/adam (2025)

  13. [13]

    In: International Conference on Principles and Practice of Multi-Agent Systems

    Li, Z., Wang, Y., Wu, W., Xu, Y., Stein, S.: Hmcf: A human-in-the-loop multi- robot collaboration framework based on large language models. In: International Conference on Principles and Practice of Multi-Agent Systems. pp. 150–167 (2025)

  14. [14]

    In: Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (2021)

    Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., State, G.: Isaac Gym: High perfor- mance GPU-based physics simulation for robot learning. In: Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (2021)

  15. [15]

    In: IEEE International Conference on Robotics and Automation (ICRA)

    Mastalli, C., Budhiraja, R., Merkt, W., Saurel, G., Hammoud, B., Naveau, M., Carpentier, J., Righetti, L., Vijayakumar, S., Mansard, N.: Crocoddyl: An efficient and versatile framework for multi-contact optimal control. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 2536–2542. IEEE (2020) 12 Wang et al

  16. [16]

    Autonomous Robots49(3), 19 (2025)

    Mattamala, M., Frey, J., Libera, P., Chebrolu, N., Martius, G., Cadena, C., Hutter, M., Fallon, M.: Wild visual navigation: Fast traversability learning via pre-trained models and online self-supervision. Autonomous Robots49(3), 19 (2025)

  17. [17]

    arXiv preprint arXiv:2202.11217 (2022)

    Meier, F., Wang, A., Sutanto, G., Lin, Y., Shah, P.: Differentiable and learnable robot models. arXiv preprint arXiv:2202.11217 (2022)

  18. [18]

    Advances in Neural Information Processing Sys- tems32(2019)

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: An imperative style, high- performance deep learning library. Advances in Neural Information Processing Sys- tems32(2019)

  19. [19]

    In: ACM SIGGRAPH

    Peng,X.B.,Ma,Z.,Abbeel,P.,Levine,S.,Kanazawa,A.:AMP:Adversarialmotion priors for stylized physics-based character animation. In: ACM SIGGRAPH. pp. 1–15 (2021)

  20. [20]

    In: IEEE International Conference on Robotics and Automation (ICRA)

    Plancher,B.,Neuman,S.M.,Ghosal,R.,Kuindersma,S.,Reddi,V.J.:GRiD:GPU- accelerated rigid body dynamics with analytical gradients. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 3728–3734. IEEE (2022)

  21. [21]

    In: Conference on Robot Learning (CoRL)

    Rudin, N., Hoeller, D., Reist, P., Hutter, M.: Learning to walk in minutes using massively parallel deep reinforcement learning. In: Conference on Robot Learning (CoRL). pp. 91–100. PMLR (2021)

  22. [22]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Sathyamoorthy, A.J., Weerakoon, K., Guan, T., Russell, M., Conover, D., Pusey, J., Manocha, D.: Vern: Vegetation-aware robot navigation in dense unstructured outdoor environments. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11233–11240. IEEE (2023)

  23. [23]

    https://drake.mit.edu (2019)

    Tedrake, R., the Drake Development Team: Drake: Model-based design and verifi- cation for robotics. https://drake.mit.edu (2019)

  24. [24]

    In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

    Tillet, P., Kung, H.T., Cox, D.: Triton: an intermediate language and compiler for tiled neural network computations. In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. pp. 10–19 (2019)

  25. [25]

    arXiv preprint arXiv:2510.01984 (2025)

    Wang, Y.: SPARC: Spine with prismatic and revolute compliance for quadruped robot. arXiv preprint arXiv:2510.01984 (2025)

  26. [26]

    In: 2025 IEEE 64th Conference on Decision and Control (CDC)

    Wang, Y., Wang, H., Li, Z.: Quattro: transformer-accelerated iterative linear quadratic regulator framework for fast trajectory optimization. In: 2025 IEEE 64th Conference on Decision and Control (CDC). pp. 6629–6636. IEEE (2025)

  27. [27]

    arXiv preprint arXiv:2103.16021 (2021)

    Werling, K., Omens, D., Lee, J., Exarchos, I., Liu, C.K.: Fast and feature-complete differentiable physics for articulated rigid bodies with contact. arXiv preprint arXiv:2103.16021 (2021)

  28. [28]

    In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Xu, Y., Zauner, K.P., Tarapore, D.: Blind-wayfarer: A minimalist, probing-driven framework for resilient navigation in perception-degraded environments. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 19340–19345. IEEE (2025)

  29. [29]

    Mujoco playground.arXiv preprint arXiv:2502.08844,

    Zakka, K., Tabanpour, B., Liao, Q., Haiderbhai, M., Holt, S., Luo, J.Y., Allshire, A., Frey, E., Sreenath, K., Kahrs, L.A., et al.: MuJoCo Playground. arXiv preprint arXiv:2502.08844 (2025)

  30. [30]

    https://github.com/UM-ARM-Lab/pytorch_ kinematics (2024)

    Zhong, S., Power, T., Gupta, A., Mitrano, P.: pytorch_kinematics: A PyTorch library for differentiable kinematics. https://github.com/UM-ARM-Lab/pytorch_ kinematics (2024)

  31. [31]

    Science robotics7(66), eabm5954 (2022)

    Zhou, X., Wen, X., Wang, Z., Gao, Y., Li, H., Wang, Q., Yang, T., Lu, H., Cao, Y., Xu, C., et al.: Swarm of micro flying robots in the wild. Science robotics7(66), eabm5954 (2022)

  32. [32]

    Zhuang,Z.,Yao,S.,Zhao,H.:Humanoidparkourlearning.In:ConferenceonRobot Learning. pp. 1975–1991. PMLR (2025) Batched Differentiable Rigid Body Dynamics in PyTorch 13 Supplementary Material ThisappendixprovidesextendedbenchmarkresultsforbardacrossfourNVIDIA GPUs: H100 80GB HBM3, A100 80GB PCIe, L40S, and L4. All timings use the same protocol as the main pape...