pith. machine review for the scientific record. sign in

arxiv: 2409.01652 · v2 · submitted 2024-09-03 · 💻 cs.RO · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords robotic manipulationkeypoint constraintsvision-language modelsreal-time optimizationlanguage instructionsSE(3) trajectorieshierarchical planning
0
0 comments X

The pith

Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that robotic manipulation tasks can be encoded as sequences of Relational Keypoint Constraints, each a Python function that maps a set of 3D keypoints to a numerical cost value. A hierarchical optimization procedure then solves these constraints to produce a sequence of end-effector poses in SE(3) that a robot can execute inside a real-time perception-action loop. To remove the need for hand-written constraints on every new task, large vision models and vision-language models are used to generate the Python functions directly from free-form language instructions together with RGB-D observations. The resulting system runs on both wheeled and dual-arm platforms and handles multi-stage, bimanual, and reactive behaviors without any task-specific training data or pre-built environment models.

Core claim

ReKep represents each constraint as a Python function that takes 3D keypoints extracted from the environment and returns a scalar cost; a sequence of such functions defines a complete task that is solved by hierarchical optimization over end-effector trajectories in SE(3), with the functions themselves produced automatically by vision-language models from language instructions and RGB-D input, enabling real-time closed-loop control across diverse manipulation scenarios.

What carries the argument

Relational Keypoint Constraints (ReKep), Python functions that map sets of 3D keypoints to numerical costs and are solved hierarchically to yield end-effector poses.

If this is right

  • Robot actions are computed as sequences of end-effector poses in SE(3) at real-time frequencies inside a perception-action loop.
  • The approach supports multi-stage, in-the-wild, bimanual, and reactive manipulation behaviors.
  • No task-specific training data or environment models are required for new tasks.
  • Constraints are generated on the fly from free-form language and RGB-D observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If vision-language models become more reliable at producing stable constraints, the method could scale to longer-horizon tasks that currently require manual decomposition.
  • The same keypoint-based cost functions might be reused across different robot embodiments by simply changing the SE(3) optimization targets.
  • Iterative refinement loops that feed execution failures back to the vision-language model could reduce the impact of occasional incorrect constraint generation.

Load-bearing premise

Vision-language models will produce correct, complete, and numerically stable Python constraint functions for arbitrary new tasks and scenes.

What would settle it

Running the generated ReKep functions on a novel scene and task where the optimizer either fails to converge, produces colliding trajectories, or executes unsafe actions that violate the intended goal.

read the original abstract

Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at https://rekep-robot.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Relational Keypoint Constraints (ReKep) as Python functions that map sets of 3D keypoints to scalar costs. It claims that representing manipulation tasks as sequences of such constraints enables a hierarchical optimization procedure to produce real-time sequences of SE(3) end-effector poses, and that large vision and vision-language models can automatically generate the required ReKep functions from free-form language instructions and RGB-D observations. Physical system demonstrations on a wheeled single-arm platform and a stationary dual-arm platform are presented for multi-stage, bimanual, in-the-wild, and reactive tasks without task-specific data or environment models.

Significance. If the VLM-generated constraints prove reliable, the work would provide a practical route to versatile, label-free manipulation by composing off-the-shelf perception models with standard optimization solvers, achieving real-time closed-loop control on two distinct physical platforms. The hierarchical formulation and perception-action loop are technically coherent, but the absence of quantitative metrics, ablations, or bounded-error analysis on the generation step limits the strength of the central claim.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): no quantitative success rates, timing statistics, ablation studies, or failure-case analysis are reported for the hierarchical optimizer or the VLM-generated constraints, despite these being required to substantiate real-time convergence and reliability across the claimed task variety.
  2. [§3.3] §3.3 (Automated ReKep Generation): the procedure that prompts VLMs to emit Python constraint functions contains no verification step, numerical stability checks, or empirical evaluation of error modes (incorrect keypoint indexing, non-differentiable operations, or incomplete temporal sequencing), which directly undermines the claim that manual specification can be circumvented for arbitrary tasks.
minor comments (2)
  1. [§3.1] Notation for the keypoint set and cost functions is introduced without a compact mathematical definition before the Python implementation; a short formalization would improve clarity.
  2. [Abstract] The website link is given but no supplementary video timestamps or failure examples are referenced in the text, making it harder for readers to locate the supporting demonstrations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the quantitative support and evaluation of the automated generation process.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): no quantitative success rates, timing statistics, ablation studies, or failure-case analysis are reported for the hierarchical optimizer or the VLM-generated constraints, despite these being required to substantiate real-time convergence and reliability across the claimed task variety.

    Authors: We acknowledge that the manuscript currently emphasizes qualitative demonstrations to illustrate versatility across diverse tasks. In the revision, we will expand §4 with quantitative success rates from repeated trials on representative tasks, timing statistics for the full perception-action loop and optimizer, ablation studies isolating the hierarchical components, and a dedicated failure-case analysis. These additions will directly support the claims of real-time convergence and reliability. revision: yes

  2. Referee: [§3.3] §3.3 (Automated ReKep Generation): the procedure that prompts VLMs to emit Python constraint functions contains no verification step, numerical stability checks, or empirical evaluation of error modes (incorrect keypoint indexing, non-differentiable operations, or incomplete temporal sequencing), which directly undermines the claim that manual specification can be circumvented for arbitrary tasks.

    Authors: We agree that additional safeguards and empirical evaluation are warranted. The revised §3.3 will include a verification step that invokes a Python interpreter to detect syntax errors and basic numerical instabilities (such as division by zero or non-differentiable operations). We will also add an empirical breakdown of observed error modes across tested tasks, including incorrect keypoint indexing and incomplete temporal sequencing, together with the mitigation strategies employed in the current implementation. revision: yes

standing simulated objections not resolved
  • A formal bounded-error analysis of the VLM-generated constraints is not feasible in this work, as it would require theoretical guarantees on large vision-language models that are currently unavailable.

Circularity Check

0 steps flagged

No circularity: system relies on external VLMs and standard solvers

full rationale

The paper defines ReKep as Python functions from 3D keypoints to costs, then uses a hierarchical optimizer on SE(3) poses and delegates generation of those functions to off-the-shelf large vision and vision-language models. No equations or procedures inside the paper reduce by construction to fitted parameters, self-citations, or renamed inputs; the central claims rest on the external models' capabilities and the optimizer's standard behavior rather than any internal derivation that loops back to the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that off-the-shelf vision-language models can produce executable, optimizable constraint functions and that hierarchical optimization over SE(3) poses will converge in real time for the generated costs; no free parameters are explicitly fitted inside the paper, and no new physical entities are postulated.

axioms (2)
  • domain assumption Large vision and language models can map free-form language and RGB-D observations to correct Python constraint functions without task-specific fine-tuning.
    Invoked in the automated procedure section of the abstract.
  • domain assumption Hierarchical optimization of sequences of keypoint costs produces feasible real-time robot trajectories in SE(3).
    Central to the perception-action loop claim.

pith-pipeline@v0.9.0 · 5581 in / 1542 out tokens · 26476 ms · 2026-05-16T08:21:03.984396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  2. PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.

  3. KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

    cs.RO 2026-04 unverdicted novelty 7.0

    KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

  4. ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

    cs.RO 2026-02 unverdicted novelty 7.0

    ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

  5. From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...

  6. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  7. Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

  8. BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances

    cs.RO 2026-04 unverdicted novelty 6.0

    BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.

  9. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  10. AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

    cs.RO 2026-04 unverdicted novelty 6.0

    AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.

  11. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  12. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  13. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  14. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  15. Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    Forecast-GS predicts task-completed 3D states via Gaussian splatting to achieve higher success rates than baselines in real-world language-conditioned manipulation tasks.

  16. BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...

  17. Synergizing Efficiency and Reliability for Continuous Mobile Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    A framework integrates anticipatory planning and real-time feedback via reliability-aware optimization and phase switching to achieve efficient, reliable continuous mobile manipulation under uncertainty.

  18. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · cited by 18 Pith papers · 18 internal anchors

  1. [1]

    L. P. Kaelbling and T. Lozano-P ´erez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010

  2. [2]

    Driess, J.-S

    D. Driess, J.-S. Ha, M. Toussaint, and R. Tedrake. Learning models as functionals of signed- distance fields for manipulation planning. In Conference on robot learning, pages 245–255. PMLR, 2022

  3. [3]

    Simeonov, Y

    A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitz- mann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA) , pages 6394–6400. IEEE, 2022

  4. [4]

    Manuelli, W

    L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for category- level robotic manipulation. In The International Symposium of Robotics Research , pages 132–157. Springer, 2019

  5. [5]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  6. [6]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. arXiv, 2023

  7. [7]

    Toussaint, J

    M. Toussaint, J. Harris, J.-S. Ha, D. Driess, and W. H ¨onig. Sequence-of-constraints mpc: Reactive timing-optimal control of sequential manipulation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13753–13760. IEEE, 2022

  8. [8]

    L. P. Kaelbling and T. Lozano-P ´erez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

  9. [9]

    Srivastava, E

    S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In 2014 IEEE international conference on robotics and automation (ICRA), 2014

  10. [10]

    Byravan and D

    A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 173–180. IEEE, 2017

  11. [11]

    N. T. Dantam, Z. K. Kingston, S. Chaudhuri, and L. E. Kavraki. An incremental constraint- based framework for task and motion planning. The International Journal of Robotics Re- search, 37(10):1134–1151, 2018

  12. [12]

    Migimatsu and J

    T. Migimatsu and J. Bohg. Object-centric task and motion planning in dynamic environments. IEEE Robotics and Automation Letters, 5(2):844–851, 2020

  13. [13]

    C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-P´erez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4:265–293, 2021. 9

  14. [14]

    Curtis, X

    A. Curtis, X. Fang, L. P. Kaelbling, T. Lozano-P ´erez, and C. R. Garrett. Long-horizon ma- nipulation of unknown objects via task and motion planning with estimated affordances. In 2022 International Conference on Robotics and Automation (ICRA), pages 1940–1946. IEEE, 2022

  15. [15]

    Labb´e, L

    Y . Labb´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic. Megapose: 6d pose estimation of novel objects via render & compare. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022

  16. [16]

    Tyree, J

    S. Tyree, J. Tremblay, T. To, J. Cheng, T. Mosier, J. Smith, and S. Birchfield. 6-dof pose esti- mation of household objects for robotic manipulation: An accessible dataset and benchmark. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13081–13088. IEEE, 2022

  17. [17]

    C. Pan, B. Okorn, H. Zhang, B. Eisner, and D. Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Conference on Robot Learning , pages 1783–1792. PMLR, 2023

  18. [18]

    B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. arXiv preprint arXiv:2312.08344, 2023

  19. [19]

    I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, volume 10. Rome, Italy, 2015

  20. [20]

    M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016

  21. [21]

    Battaglia, R

    P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems, 29, 2016

  22. [22]

    Sanchez-Gonzalez, N

    A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. In International Conference on Machine Learning, pages 4470–4479. PMLR, 2018

  23. [23]

    E. Jang, C. Devin, V . Vanhoucke, and S. Levine. Grasp2vec: Learning object representations from self-supervised grasping. arXiv preprint arXiv:1811.06964, 2018

  24. [24]

    Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

    J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birchfield. Deep ob- ject pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018

  25. [25]

    Z. Xu, J. Wu, A. Zeng, J. B. Tenenbaum, and S. Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853, 2019

  26. [26]

    J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019

  27. [27]

    C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Ler- chner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019

  28. [28]

    Hewing, K

    L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger. Learning-based model pre- dictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 3:269–296, 2020. 10

  29. [29]

    Locatello, D

    F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf. Object-centric learning with slot attention. Advances in neural information processing systems, 33:11525–11538, 2020

  30. [30]

    Heravi, A

    N. Heravi, A. Wahid, C. Lynch, P. Florence, T. Armstrong, J. Tompson, P. Sermanet, J. Bohg, and D. Dwibedi. Visuomotor control in multi-object scenes using object-aware representa- tions. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 9515–9522. IEEE, 2023

  31. [31]

    Y . Zhu, Z. Jiang, P. Stone, and Y . Zhu. Learning generalizable manipulation policies with object-centric 3d representations. arXiv preprint arXiv:2310.14386, 2023

  32. [32]

    W. Yuan, C. Paxton, K. Desingh, and D. Fox. Sornet: Spatial object-centric representations for sequential manipulation. In Conference on Robot Learning, pages 148–157. PMLR, 2022

  33. [33]

    Cheng, C

    S. Cheng, C. Garrett, A. Mandlekar, and D. Xu. Nod-tamp: Multi-step manipulation planning with neural object descriptors. arXiv preprint arXiv:2311.01530, 2023

  34. [34]

    J. Hsu, J. Mao, J. Tenenbaum, and J. Wu. What’s left? concept grounding with logic-enhanced foundation models. Advances in Neural Information Processing Systems, 36, 2024

  35. [35]

    Y . Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566, 2018

  36. [36]

    X. Lin, C. Qi, Y . Zhang, Z. Huang, K. Fragkiadaki, Y . Li, C. Gan, and D. Held. Planning with spatial-temporal abstraction from point clouds for deformable object manipulation. arXiv preprint arXiv:2210.15751, 2022

  37. [37]

    Y . Wang, Y . Li, K. Driggs-Campbell, L. Fei-Fei, and J. Wu. Dynamic-resolution model learning for object pile manipulation. arXiv preprint arXiv:2306.16700, 2023

  38. [38]

    H. Shi, H. Xu, S. Clarke, Y . Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023

  39. [39]

    X. Lin, Y . Wang, Z. Huang, and D. Held. Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, pages 256–266. PMLR, 2022

  40. [40]

    Abou-Chakra, K

    J. Abou-Chakra, K. Rana, F. Dayoub, and N. S ¨underhauf. Physically embodied gaussian splatting: A realtime correctable world model for robotics. arXiv preprint arXiv:2406.10788, 2024

  41. [41]

    Bauer, Z

    D. Bauer, Z. Xu, and S. Song. Doughnet: A visual predictive model for topological manipu- lation of deformable objects. arXiv preprint arXiv:2404.12524, 2024

  42. [42]

    Schmidt, R

    T. Schmidt, R. Newcombe, and D. Fox. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2(2):420–427, 2016

  43. [43]

    P. R. Florence, L. Manuelli, and R. Tedrake. Dense object nets: Learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756, 2018

  44. [44]

    T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V . Mnih. Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019

  45. [45]

    Z. Qin, K. Fang, Y . Zhu, L. Fei-Fei, and S. Savarese. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020. 11

  46. [46]

    Sundaresan, J

    P. Sundaresan, J. Grannen, B. Thananjeyan, A. Balakrishna, M. Laskey, K. Stone, J. E. Gon- zalez, and K. Goldberg. Learning rope manipulation policies using dense object descriptors trained on synthetic depth data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9411–9418. IEEE, 2020

  47. [47]

    Manuelli, Y

    L. Manuelli, Y . Li, P. Florence, and R. Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085 , 2020

  48. [48]

    B. Chen, P. Abbeel, and D. Pathak. Unsupervised learning of visual 3d keypoints for control. In International Conference on Machine Learning, pages 1539–1549. PMLR, 2021

  49. [49]

    Simeonov, Y

    A. Simeonov, Y . Du, Y .-C. Lin, A. R. Garcia, L. P. Kaelbling, T. Lozano-P ´erez, and P. Agrawal. Se (3)-equivariant relational rearrangement with neural descriptor fields. In Conference on Robot Learning, pages 835–846. PMLR, 2023

  50. [50]

    Vecerik, C

    M. Vecerik, C. Doersch, Y . Yang, T. Davchev, Y . Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv preprint arXiv:2308.15975, 2023

  51. [51]

    E. Chun, Y . Du, A. Simeonov, T. Lozano-Perez, and L. Kaelbling. Local neural descriptor fields: Locally conditioned object representations for manipulation. In 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 1830–1836. IEEE, 2023

  52. [52]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  53. [53]

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023

  54. [54]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation, 2024

  55. [55]

    Kingston, M

    Z. Kingston, M. Moll, and L. E. Kavraki. Sampling-based methods for motion planning with constraints. Annual review of control, robotics, and autonomous systems, 1:159–185, 2018

  56. [56]

    Ratliff, M

    N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa. Chomp: Gradient optimization tech- niques for efficient motion planning. In 2009 IEEE international conference on robotics and automation, pages 489–494. IEEE, 2009

  57. [57]

    Schulman, Y

    J. Schulman, Y . Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan, S. Patil, K. Goldberg, and P. Abbeel. Motion planning with sequential convex optimization and convex collision checking. The International Journal of Robotics Research, 33(9):1251–1270, 2014

  58. [58]

    Sundaralingam, S

    B. Sundaralingam, S. K. S. Hari, A. Fishman, C. Garrett, K. Van Wyk, V . Blukis, A. Millane, H. Oleynikova, A. Handa, F. Ramos, et al. Curobo: Parallelized collision-free robot motion generation. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 8112–8119. IEEE, 2023

  59. [59]

    Marcucci, J

    T. Marcucci, J. Umenberger, P. Parrilo, and R. Tedrake. Shortest paths in graphs of convex sets. SIAM Journal on Optimization, 34(1):507–532, 2024

  60. [60]

    N. D. Ratliff, J. Issac, D. Kappler, S. Birchfield, and D. Fox. Riemannian motion policies. arXiv preprint arXiv:1801.02854, 2018

  61. [61]

    M. Posa, S. Kuindersma, and R. Tedrake. Optimization and stabilization of trajectories for constrained dynamical systems. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1366–1373. IEEE, 2016. 12

  62. [62]

    Mordatch, E

    I. Mordatch, E. Todorov, and Z. Popovi ´c. Discovery of complex behaviors through contact- invariant optimization. ACM Transactions on Graphics (ToG), 31(4):1–8, 2012

  63. [63]

    Mordatch, Z

    I. Mordatch, Z. Popovi ´c, and E. Todorov. Contact-invariant optimization for hand manipula- tion. In Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer anima- tion, pages 137–144, 2012

  64. [64]

    M. Posa, C. Cantu, and R. Tedrake. A direct method for trajectory optimization of rigid bodies through contact. The International Journal of Robotics Research, 33(1):69–81, 2014

  65. [65]

    Howell, N

    T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y . Tassa. Predictive sampling: Real-time behaviour synthesis with mujoco. arXiv preprint arXiv:2212.00541 , 2022

  66. [66]

    Z. Liu, G. Zhou, J. He, T. Marcucci, F.-F. Li, J. Wu, and Y . Li. Model-based control with sparse neural dynamics. Advances in Neural Information Processing Systems, 36, 2024

  67. [67]

    K. M. Lynch and M. T. Mason. Stable pushing: Mechanics, controllability, and planning.The international journal of robotics research, 15(6):533–556, 1996

  68. [68]

    Y . Hou, Z. Jia, and M. T. Mason. Fast planning for 3d any-pose-reorienting using pivoting. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1631–1638. IEEE, 2018

  69. [69]

    Sleiman, F

    J.-P. Sleiman, F. Farshidian, and M. Hutter. Versatile multicontact planning and control for legged loco-manipulation. Science Robotics, 8(81):eadg5014, 2023

  70. [70]

    Yang and M

    W. Yang and M. Posa. Dynamic on-palm manipulation via controlled sliding. arXiv preprint arXiv:2405.08731, 2024

  71. [71]

    B. P. Graesdal, S. Y . Chia, T. Marcucci, S. Morozov, A. Amice, P. A. Parrilo, and R. Tedrake. Towards tight convex relaxations for contact-rich manipulation. arXiv preprint arXiv:2402.10312, 2024

  72. [72]

    Yunt and C

    K. Yunt and C. Glocker. Trajectory optimization of mechanical hybrid systems using sumt. In 9th IEEE International Workshop on Advanced Motion Control, 2006. , pages 665–671. IEEE, 2006

  73. [73]

    Yunt and C

    K. Yunt and C. Glocker. A combined continuation and penalty method for the determination of optimal hybrid mechanical trajectories. In Iutam Symposium on Dynamics and Control of Nonlinear Systems with Uncertainty: Proceedings of the IUTAM Symposium held in Nanjing, China, September 18-22, 2006, pages 187–196. Springer, 2007

  74. [74]

    K. Yunt. An augmented lagrangian based shooting method for the optimal trajectory gen- eration of switching lagrangian systems. Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications and Algorithms, 18(5):615–645, 2011

  75. [75]

    Lagriffoul, D

    F. Lagriffoul, D. Dimitrov, A. Saffiotti, and L. Karlsson. Constraint propagation on interval bounds for dealing with geometric backtracking. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 957–964. IEEE, 2012

  76. [76]

    Lagriffoul, D

    F. Lagriffoul, D. Dimitrov, J. Bidot, A. Saffiotti, and L. Karlsson. Efficiently combining task and motion planning using geometric constraints. The International Journal of Robotics Research, 33(14):1726–1747, 2014

  77. [77]

    Lozano-P ´erez and L

    T. Lozano-P ´erez and L. P. Kaelbling. A constraint-based method for solving sequential ma- nipulation planning problems. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3684–3691. IEEE, 2014. 13

  78. [78]

    Z. Yang, J. Mao, Y . Du, J. Wu, J. B. Tenenbaum, T. Lozano-P´erez, and L. P. Kaelbling. Com- positional diffusion-based continuous constraint solvers. arXiv preprint arXiv:2309.00966, 2023

  79. [79]

    Silver, R

    T. Silver, R. Chitnis, J. Tenenbaum, L. P. Kaelbling, and T. Lozano-P ´erez. Learning sym- bolic operators for task and motion planning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3182–3189. IEEE, 2021

  80. [80]

    B. Vu, T. Migimatsu, and J. Bohg. Coast: Constraints and streams for task and motion planning. arXiv preprint arXiv:2405.08572, 2024

Showing first 80 references.