pith. machine review for the scientific record. sign in

arxiv: 2605.03363 · v1 · submitted 2026-05-05 · 💻 cs.RO · cs.SY· eess.SY

Recognition: unknown

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:52 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords dexterous graspinghierarchical controlreinforcement learningquadratic programmingsim-to-real transferreactive manipulationmulti-agent RLtask-space planning
0
0 comments X

The pith

A hybrid multi-agent RL planner and QP controller decouples high-level task intent from low-level joint execution to enable reactive dexterous grasping with zero-shot steerability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a hierarchical control system that splits the problem into two layers: reinforcement learning agents that output desired velocities in task space for the arm and hand separately, and a quadratic programming solver that converts those velocities into safe joint commands while respecting limits and avoiding collisions. This split speeds up training and builds safety directly into the execution layer. The result is a policy that can be steered at runtime by changing safety parameters or adding obstacle avoidance without any retraining, and that transfers from simulation to a real 7-DoF arm with 20-DoF hand for grasping unseen objects while recovering from pushes and other disturbances.

Core claim

By training separate arm and hand reinforcement learning agents to produce task-space velocity commands and then feeding those commands into a GPU-parallelized quadratic programming controller that enforces kinematic limits and collision constraints, the framework achieves both faster policy learning and the ability to adjust safety margins or avoid dynamic obstacles at runtime without retraining.

What carries the argument

Multi-agent RL high-level planner that outputs task-space velocities, processed by a GPU-parallelized quadratic programming low-level controller that maps them to feasible joint velocities while enforcing safety constraints.

If this is right

  • Training convergence accelerates because the RL agents only learn task-space behavior while the QP layer handles all joint-level constraints.
  • Hardware safety is strictly enforced at every time step regardless of what the RL policy outputs.
  • System operators can change collision avoidance margins or add new obstacles dynamically without retraining.
  • The same policy transfers zero-shot to real hardware and recovers from unexpected physical disturbances on diverse unseen objects.
  • The architecture isolates high-level spatial intent from low-level execution, allowing independent development or tuning of each layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of task-space planning from joint-space enforcement could apply to other contact-rich manipulation skills such as in-hand reorientation or tool use.
  • Adding online perception to the high-level planner might allow the system to react to moving targets or changing object properties without altering the QP layer.
  • The explicit safety layer may reduce the amount of reward shaping needed during RL training compared with end-to-end joint-space policies.

Load-bearing premise

The simulation environment matches real-world contact dynamics closely enough that the learned velocity commands remain feasible for the QP solver in real time.

What would settle it

A real-world trial in which the QP solver returns infeasible solutions or the robot collides or drops objects under the same velocity commands that worked in simulation.

Figures

Figures reproduced from arXiv: 2605.03363 by Alexander Alexiev, Ho Jae Lee, Sangbae Kim, Se Hwan Jeon, Tzu-Yuan Lin, Yonghyeon Lee.

Figure 1
Figure 1. Figure 1: Overview of the proposed multi-agent RL framework with hardware view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the hybrid hierarchical control framework. Our architecture consists of a high-level RL planner operating at 100 Hz and a low-level QP view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the fixed-based manipulation platforms and coordinate view at source ↗
Figure 4
Figure 4. Figure 4: Simulation results demonstrating the reach-grasp-lift progression of the proposed framework. The top two rows show the 20-DoF 5F hand grasping view at source ↗
Figure 7
Figure 7. Figure 7: Experimental setup for real-world hardware validation. We evaluated view at source ↗
Figure 6
Figure 6. Figure 6: Per-object grasp success rates for the 5F hand, highlighting the five view at source ↗
Figure 8
Figure 8. Figure 8: Hardware experiments demonstrating real-world grasping with the 5F hand on previously unseen objects. (a) Grasping a cup demonstrates the view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of control architectures and corresponding learning objectives. Unlike the view at source ↗
Figure 10
Figure 10. Figure 10: Training performance comparison of the proposed framework against view at source ↗
Figure 12
Figure 12. Figure 12: Spatial distribution of the palm velocity tracking error across the view at source ↗
Figure 13
Figure 13. Figure 13: Design of the APF for online task-space velocity modulation. To view at source ↗
Figure 14
Figure 14. Figure 14: By superimposing the APF-generated repulsive velocity ( view at source ↗
Figure 15
Figure 15. Figure 15: Post-training steerability via joint velocity limit scaling. Reducing view at source ↗
read the original abstract

In this work, we propose a hybrid hierarchical control framework for reactive dexterous grasping that explicitly decouples high-level spatial intent from low-level joint execution. We introduce a multi-agent reinforcement learning architecture, specialized into distinct arm and hand agents, that acts as a high-level planner by generating desired task-space velocity commands. These commands are then processed by a GPU-parallelized quadratic programming controller, which translates them into feasible joint velocities while strictly enforcing kinematic limits and collision avoidance. This structural isolation not only accelerates training convergence but also strictly enforces hardware safety. Furthermore, the architecture unlocks zero-shot steerability, allowing system operators to dynamically adjust safety margins and avoid dynamic obstacles without retraining the policy. We extensively validate the proposed framework through a rigorous simulation-to-reality pipeline. Real-world hardware experiments on a 7-DoF arm equipped with a 20-DoF anthropomorphic hand demonstrate highly robust zero-shot transferability for dexterous grasping to a diverse set of unseen objects, highlighting the system's ability to reactively recover from unexpected physical disturbances in unstructured environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a hybrid hierarchical framework for reactive dexterous grasping that decouples high-level task-space velocity planning (via multi-agent RL with separate arm and hand agents) from low-level joint-space execution (via a GPU-parallelized QP controller enforcing kinematic limits and collision avoidance). It claims this structure accelerates RL training, strictly enforces hardware safety, and enables zero-shot steerability: operators can dynamically tighten safety margins or introduce dynamic obstacles at runtime without retraining the policy. The approach is validated via a sim-to-real pipeline on a 7-DoF arm with 20-DoF anthropomorphic hand, demonstrating robust grasping of unseen objects and recovery from physical disturbances in unstructured environments.

Significance. If the empirical claims hold, the work offers a practical route to combining RL adaptability with optimization-based safety in dexterous manipulation. The explicit decoupling and resulting steerability could reduce the need for retraining when deployment conditions change, which is valuable for real-world robotics. The sim-to-real focus and hardware validation on a high-DoF system are positive, though the absence of detailed quantitative metrics limits immediate assessment of impact.

major comments (2)
  1. [abstract and §4 (experiments)] The central claim of zero-shot steerability (abstract and §4) rests on the QP controller remaining feasible and real-time solvable when safety margins are tightened or dynamic obstacles are added at runtime. However, the RL agents are trained exclusively under the nominal constraint set; no regularization, adversarial training, or post-training analysis is described that ensures the QP feasible set remains non-empty under operator-induced changes. This is load-bearing for the steerability result.
  2. [§5 and abstract] §5 (or equivalent results section) and the abstract assert successful sim-to-real transfer and disturbance recovery, yet no quantitative metrics (success rates, recovery times, failure rates), ablation studies on the hierarchical RL+QP split, or baseline comparisons are referenced. Without these, the robustness claims cannot be evaluated and the soundness of the sim-to-real pipeline remains unverified.
minor comments (2)
  1. [§3] Notation for the task-space velocity commands generated by the RL agents versus the QP inputs should be clarified in §3 to avoid ambiguity between planning and control layers.
  2. [§3] The description of the multi-agent RL architecture would benefit from an explicit diagram or pseudocode showing the information flow between arm and hand agents and the shared task-space output.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [abstract and §4 (experiments)] The central claim of zero-shot steerability (abstract and §4) rests on the QP controller remaining feasible and real-time solvable when safety margins are tightened or dynamic obstacles are added at runtime. However, the RL agents are trained exclusively under the nominal constraint set; no regularization, adversarial training, or post-training analysis is described that ensures the QP feasible set remains non-empty under operator-induced changes. This is load-bearing for the steerability result.

    Authors: We acknowledge that the manuscript does not provide an explicit post-training analysis of QP feasibility under modified constraints. The QP controller is formulated as a prioritized optimization problem that minimizes task-space tracking error subject to hard kinematic and collision constraints; in practice this remains feasible for moderate runtime changes because the high-level RL policy produces commands that are typically well inside the nominal feasible set. To strengthen the claim, we will add a new subsection in §4 with both theoretical conditions for feasibility (based on the null-space projection and slack prioritization) and empirical verification by re-running the QP solver offline on logged trajectories with tightened margins and injected obstacles. revision: yes

  2. Referee: [§5 and abstract] §5 (or equivalent results section) and the abstract assert successful sim-to-real transfer and disturbance recovery, yet no quantitative metrics (success rates, recovery times, failure rates), ablation studies on the hierarchical RL+QP split, or baseline comparisons are referenced. Without these, the robustness claims cannot be evaluated and the soundness of the sim-to-real pipeline remains unverified.

    Authors: We agree that the current results section would benefit from more detailed quantitative reporting. We will expand §5 with tables reporting success rates (over at least 50 trials per object category), mean recovery times from external disturbances, and failure-mode breakdowns. We will also add ablation experiments that isolate the contribution of the multi-agent RL planner versus the QP layer, as well as comparisons against an end-to-end RL baseline and a pure model-based QP tracker. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of decoupled RL-QP architecture

full rationale

The paper presents a hybrid hierarchical framework with multi-agent RL generating task-space velocities that are then mapped by a QP controller enforcing constraints. The central claim of zero-shot steerability via runtime safety-margin adjustment is asserted as a consequence of the structural decoupling and is supported by sim-to-real hardware experiments on unseen objects and disturbances. No equations, fitted parameters, or self-citations are shown that reduce this claim to the training inputs by construction; the RL training occurs under nominal constraints while steerability is treated as an emergent property verified externally. The derivation chain is therefore self-contained as an empirical architecture proposal rather than a self-referential definition or renamed known result.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework depends on standard RL training assumptions and the existence of feasible QP solutions at every timestep; no new physical entities are postulated.

free parameters (2)
  • RL reward weights and learning rates
    Standard hyperparameters that must be tuned to achieve the reported convergence and transfer.
  • QP objective weights
    Trade-off parameters between velocity tracking, limit satisfaction, and collision avoidance that are chosen to make the controller work.
axioms (2)
  • domain assumption The physics simulator accurately reproduces real contact and friction behavior for the tested objects.
    Required for the claimed zero-shot sim-to-real transfer to hold.
  • domain assumption The QP always admits a feasible solution within the real-time budget.
    Implicit in the claim that the controller strictly enforces safety.

pith-pipeline@v0.9.0 · 5510 in / 1263 out tokens · 49990 ms · 2026-05-07T15:52:32.276822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    K. M. Lynch and F. C. Park,Modern Robotics: Mechanics, Planning, and Control. Cambridge University Press, 2017, Chapter 12: Grasping and Manipulation

  2. [2]

    Graspqp: Differentiable optimization of force closure for diverse and robust dexterous grasp- ing,

    R. Zurbr ¨ugg, A. Cramariuc, and M. Hutter, “Graspqp: Differentiable optimization of force closure for diverse and robust dexterous grasp- ing,”arXiv preprint arXiv:2508.15002, 2025

  3. [3]

    Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,

    J. Chen, Y . Ke, L. Peng, and H. Wang, “Dexonomy: Synthesizing all dexterous grasp types in a grasp taxonomy,”arXiv preprint arXiv:2504.18829, 2025

  4. [4]

    Dexterous contact-rich manipulation via the contact trust region,

    H. T. Suh, T. Pang, T. Zhao, and R. Tedrake, “Dexterous contact-rich manipulation via the contact trust region,”The International Journal of Robotics Research, p. 02 783 649 251 398 875, 2025

  5. [5]

    Planning contact points for humanoid robots,

    A. Escande, A. Kheddar, and S. Miossec, “Planning contact points for humanoid robots,”Robotics and Autonomous Systems, vol. 61, no. 5, pp. 428–442, 2013

  6. [6]

    Global planning for contact-rich manipulation via local smoothing of quasi-dynamic contact models,

    T. Pang, H. T. Suh, L. Yang, and R. Tedrake, “Global planning for contact-rich manipulation via local smoothing of quasi-dynamic contact models,”IEEE Transactions on robotics, vol. 39, no. 6, pp. 4691–4711, 2023

  7. [7]

    A direct method for trajectory optimization of rigid bodies through contact,

    M. Posa, C. Cantu, and R. Tedrake, “A direct method for trajectory optimization of rigid bodies through contact,”The International Journal of Robotics Research, vol. 33, no. 1, pp. 69–81, 2014

  8. [8]

    Dextrah- rgb: Visuomotor policies to grasp anything with dexterous hands,

    R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. Van Wyk, “Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands,”arXiv preprint arXiv:2412.01791, 2024

  9. [9]

    Robustdex- grasp: Robust dexterous grasping of general objects,

    H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song, “Robustdex- grasp: Robust dexterous grasping of general objects,”arXiv preprint arXiv:2504.05287, 2025

  10. [10]

    Real-time obstacle avoidance for manipulators and mobile robots,

    O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,”The international journal of robotics research, vol. 5, no. 1, pp. 90–98, 1986. 16

  11. [11]

    Avoidance of concave obsta- cles through rotation of nonlinear dynamics,

    L. Huber, J.-J. Slotine, and A. Billard, “Avoidance of concave obsta- cles through rotation of nonlinear dynamics,”IEEE Transactions on Robotics, vol. 40, pp. 1983–2002, 2023

  12. [12]

    Behavior-controllable stable dynamics models on riemannian configuration manifolds,

    B. Lee, Y . Lee, J. Ha, and F. C. Park, “Behavior-controllable stable dynamics models on riemannian configuration manifolds,”IEEE Transactions on Robotics, 2025

  13. [13]

    Mapping behavioral repertoire onto the cortex,

    M. S. Graziano and T. N. Aflalo, “Mapping behavioral repertoire onto the cortex,”Neuron, vol. 56, no. 2, pp. 239–251, 2007

  14. [14]

    Postural hand synergies for tool use,

    M. Santello, M. Flanders, and J. F. Soechting, “Postural hand synergies for tool use,”Journal of neuroscience, vol. 18, no. 23, pp. 10 105–10 115, 1998

  15. [15]

    The timing of natural prehension movements,

    M. Jeannerod, “The timing of natural prehension movements,”Jour- nal of motor behavior, vol. 16, no. 3, pp. 235–254, 1984

  16. [16]

    The reach- to-grasp movement in parkinson’s disease before and after dopamin- ergic medication,

    U. Castiello, K. Bennett, C. Bonfiglioli, and R. Peppard, “The reach- to-grasp movement in parkinson’s disease before and after dopamin- ergic medication,”Neuropsychologia, vol. 38, no. 1, pp. 46–59, 2000

  17. [17]

    Learning humanoid arm mo- tion via centroidal momentum regularized multi-agent reinforcement learning,

    H. J. Lee, S. H. Jeon, and S. Kim, “Learning humanoid arm mo- tion via centroidal momentum regularized multi-agent reinforcement learning,”IEEE Robotics and Automation Letters, 2025

  18. [18]

    Gradient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 5824–5836, 2020

  19. [19]

    Ray interference: A source of plateaus in deep reinforcement learning,

    T. Schaul, D. Borsa, J. Modayil, and R. Pascanu, “Ray interference: A source of plateaus in deep reinforcement learning,”arXiv preprint arXiv:1904.11455, 2019

  20. [20]

    Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,

    S. H. Jeon, S. Hong, H. J. Lee, C. Khazoom, and S. Kim, “Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,”IEEE Robotics and Automation Letters, 2024

  21. [21]

    Rimon,Exact robot navigation using artificial potential functions

    E. Rimon,Exact robot navigation using artificial potential functions. Yale University, 1990

  22. [22]

    Safety assessment and control of robotic manipulators using danger field,

    B. Lacevic, P. Rocco, and A. M. Zanchettin, “Safety assessment and control of robotic manipulators using danger field,”IEEE Transac- tions on Robotics, vol. 29, no. 5, pp. 1257–1270, 2013

  23. [23]

    Navigation func- tions for convex potentials in a space with convex obstacles,

    S. Paternain, D. E. Koditschek, and A. Ribeiro, “Navigation func- tions for convex potentials in a space with convex obstacles,”IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 2944–2959, 2017

  24. [24]

    Vector fields for robot navigation along time-varying curves inn-dimensions,

    V . M. Goncalves, L. C. Pimenta, C. A. Maia, B. C. Dutra, and G. A. Pereira, “Vector fields for robot navigation along time-varying curves inn-dimensions,”IEEE Transactions on Robotics, vol. 26, no. 4, pp. 647–659, 2010

  25. [25]

    A dynamical system approach to realtime obstacle avoidance,

    S. M. Khansari-Zadeh and A. Billard, “A dynamical system approach to realtime obstacle avoidance,”Autonomous Robots, vol. 32, pp. 433– 454, 2012

  26. [26]

    Avoiding dense and dynamic obstacles in enclosed spaces: Application to moving in crowds,

    L. Huber, J.-J. Slotine, and A. Billard, “Avoiding dense and dynamic obstacles in enclosed spaces: Application to moving in crowds,”IEEE Transactions on Robotics, vol. 38, no. 5, pp. 3113–3132, 2022

  27. [27]

    Billard, S

    A. Billard, S. Mirrazavi, and N. Figueroa,Learning for adaptive and reactive robot control: a dynamical systems approach. Mit Press, 2022

  28. [28]

    Reactive collision-free motion generation in joint space via dynamical systems and sampling- based mpc,

    M. Koptev, N. Figueroa, and A. Billard, “Reactive collision-free motion generation in joint space via dynamical systems and sampling- based mpc,”The International Journal of Robotics Research, vol. 43, no. 13, pp. 2049–2069, 2024

  29. [29]

    Hierarchical reactive grasping via task-space velocity fields and joint-space quadratic programming,

    Y . Lee, T.-Y . Lin, A. Alexiev, and S. Kim, “Hierarchical reactive grasping via task-space velocity fields and joint-space quadratic programming,”arXiv preprint arXiv:2509.01044, 2025

  30. [30]

    Contact-implicit differential dynamic programming for model predictive control with relaxed complementarity constraints,

    G. Kim, D. Kang, J.-H. Kim, and H.-W. Park, “Contact-implicit differential dynamic programming for model predictive control with relaxed complementarity constraints,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2022, pp. 11 978–11 985

  31. [31]

    Dexterous manipulation for multi-fingered robotic hands with reinforcement learning: A review,

    C. Yu and P. Wang, “Dexterous manipulation for multi-fingered robotic hands with reinforcement learning: A review,”Frontiers in Neurorobotics, vol. 16, p. 861 825, 2022

  32. [32]

    Learning dexterous in-hand manipula- tion,

    O. M. Andrychowicz et al., “Learning dexterous in-hand manipula- tion,”The International Journal of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020

  33. [33]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    A. Rajeswaran et al., “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,”arXiv preprint arXiv:1709.10087, 2017

  34. [34]

    Unidexgrasp++: Improving dexterous grasping pol- icy learning via geometry-aware curriculum and iterative generalist- specialist learning,

    W. Wan et al., “Unidexgrasp++: Improving dexterous grasping pol- icy learning via geometry-aware curriculum and iterative generalist- specialist learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3891–3902

  35. [35]

    Dexterous functional grasping,

    A. Agarwal, S. Uppal, K. Shaw, and D. Pathak, “Dexterous functional grasping,”arXiv preprint arXiv:2312.02975, 2023

  36. [36]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016

  37. [37]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar et al., “What matters in learning from offline human demonstrations for robot manipulation,”arXiv preprint arXiv:2108.03298, 2021

  38. [38]

    Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks,

    R. Mart ´ın-Mart´ın, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks,” in2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, 2019, pp. 1010–1017

  39. [39]

    Curious exploration via struc- tured world models yields zero-shot object manipulation,

    C. Sancaktar, S. Blaes, and G. Martius, “Curious exploration via struc- tured world models yields zero-shot object manipulation,”Advances in Neural Information Processing Systems, vol. 35, pp. 24 170–24 183, 2022

  40. [40]

    Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

    A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman, “Relay policy learning: Solving long-horizon tasks via imitation and rein- forcement learning,”arXiv preprint arXiv:1910.11956, 2019

  41. [41]

    Dextrah-g: Pixels-to- action dexterous arm-hand grasping with geometric fabrics,

    T. G. W. Lum et al., “Dextrah-g: Pixels-to-action dexterous arm-hand grasping with geometric fabrics,”arXiv preprint arXiv:2407.02274, 2024

  42. [42]

    Reinforcement learning in robotics: A survey,

    J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013

  43. [43]

    R. S. Sutton, A. G. Barto, et al.,Reinforcement learning: An intro- duction. MIT press Cambridge, 1998, vol. 1

  44. [44]

    Multi-agent deep reinforcement learn- ing: A survey,

    S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learn- ing: A survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 895– 943, 2022

  45. [45]

    Multi-agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,”Advances in neural information processing systems, vol. 30, 2017

  46. [46]

    Counterfactual multi-agent policy gradients,

    J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, 2018

  47. [47]

    Boyd and L

    S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004, Chapter 11: Interior-point methods

  48. [48]

    Casadi: A software framework for nonlinear optimization and op- timal control,

    J. A. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “Casadi: A software framework for nonlinear optimization and op- timal control,”Mathematical Programming Computation, vol. 11, pp. 1–36, 2019

  49. [49]

    To- wards robust autonomous grasping with reflexes using high-bandwidth sensing and actuation,

    A. SaLoutos, H. Kim, E. Stanger-Jones, M. Guo, and S. Kim, “To- wards robust autonomous grasping with reflexes using high-bandwidth sensing and actuation,”arXiv preprint arXiv:2209.11367, 2022

  50. [50]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    M. Mittal et al., “Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning,”arXiv preprint arXiv:2511.04831, 2025

  51. [51]

    Perceptive locomotion through nonlinear model-predictive control,

    R. Grandia, F. Jenelten, S. Yang, F. Farshidian, and M. Hutter, “Perceptive locomotion through nonlinear model-predictive control,” IEEE Transactions on Robotics, vol. 39, no. 5, pp. 3402–3421, 2023

  52. [52]

    Mujoco: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems, IEEE, 2012, pp. 5026–5033

  53. [53]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Osqp: an operator splitting solver for quadratic programs.Mathematical Programming Computation, 12(4):637–672, 2020

    B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP: An operator splitting solver for quadratic programs,”Mathematical Programming Computation, vol. 12, no. 4, pp. 637–672, 2020.DOI: 10.1007/s12532-020-00179-2 [Online]. Available: https://doi.org/10. 1007/s12532-020-00179-2

  55. [55]

    The ycb object and model set: Towards common benchmarks for manipulation research,

    B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The ycb object and model set: Towards common benchmarks for manipulation research,” in2015 international conference on advanced robotics (ICAR), IEEE, 2015, pp. 510–517

  56. [56]

    Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers

    T.-Y . Lin, H. J. Lee, K. Doherty, Y . Lee, and S. Kim, “Point2pose: Occlusion-recovering 6d pose tracking and 3d reconstruction for multiple unknown objects via 2d point trackers,”arXiv preprint arXiv:2604.10415, 2026

  57. [57]

    The pinocchio c++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives,

    J. Carpentier et al., “The pinocchio c++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives,” in2019 IEEE/SICE International Symposium on System Integration (SII), IEEE, 2019, pp. 614–619. 17 APPENDIXA MULTI-AGENTRL FORMULATIONDETAILS This section provides the comprehensive implementation details...

  58. [57]

    The pinocchio c++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives,

    J. Carpentier et al., “The pinocchio c++ library: A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives,” in2019 IEEE/SICE International Symposium on System Integration (SII), IEEE, 2019, pp. 614–619. 17 APPENDIXA MULTI-AGENTRL FORMULATIONDETAILS This section provides the comprehensive implementation details...