arxiv: 2409.01652 · v2 · submitted 2024-09-03 · 💻 cs.RO · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang , Chen Wang , Yunzhu Li , Ruohan Zhang , Li Fei-Fei

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords robotic manipulationkeypoint constraintsvision-language modelsreal-time optimizationlanguage instructionsSE(3) trajectorieshierarchical planning

0 comments

The pith

Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that robotic manipulation tasks can be encoded as sequences of Relational Keypoint Constraints, each a Python function that maps a set of 3D keypoints to a numerical cost value. A hierarchical optimization procedure then solves these constraints to produce a sequence of end-effector poses in SE(3) that a robot can execute inside a real-time perception-action loop. To remove the need for hand-written constraints on every new task, large vision models and vision-language models are used to generate the Python functions directly from free-form language instructions together with RGB-D observations. The resulting system runs on both wheeled and dual-arm platforms and handles multi-stage, bimanual, and reactive behaviors without any task-specific training data or pre-built environment models.

Core claim

ReKep represents each constraint as a Python function that takes 3D keypoints extracted from the environment and returns a scalar cost; a sequence of such functions defines a complete task that is solved by hierarchical optimization over end-effector trajectories in SE(3), with the functions themselves produced automatically by vision-language models from language instructions and RGB-D input, enabling real-time closed-loop control across diverse manipulation scenarios.

What carries the argument

Relational Keypoint Constraints (ReKep), Python functions that map sets of 3D keypoints to numerical costs and are solved hierarchically to yield end-effector poses.

If this is right

Robot actions are computed as sequences of end-effector poses in SE(3) at real-time frequencies inside a perception-action loop.
The approach supports multi-stage, in-the-wild, bimanual, and reactive manipulation behaviors.
No task-specific training data or environment models are required for new tasks.
Constraints are generated on the fly from free-form language and RGB-D observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If vision-language models become more reliable at producing stable constraints, the method could scale to longer-horizon tasks that currently require manual decomposition.
The same keypoint-based cost functions might be reused across different robot embodiments by simply changing the SE(3) optimization targets.
Iterative refinement loops that feed execution failures back to the vision-language model could reduce the impact of occasional incorrect constraint generation.

Load-bearing premise

Vision-language models will produce correct, complete, and numerically stable Python constraint functions for arbitrary new tasks and scenes.

What would settle it

Running the generated ReKep functions on a novel scene and task where the optimizer either fails to converge, produces colliding trajectories, or executes unsafe actions that violate the intended goal.

read the original abstract

Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at https://rekep-robot.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReKep turns language and RGB-D into Python keypoint constraint functions that feed a real-time hierarchical optimizer, with solid hardware demos but thin evidence on how often the VLM step actually produces usable code.

read the letter

The new piece here is expressing tasks as Python functions over 3D keypoints that get generated automatically from free-form language and RGB-D input. That removes the usual manual labeling or per-task data collection, and the paper shows the resulting constraints can drive a hierarchical optimizer to produce SE(3) end-effector sequences at real-time rates on two physical platforms. The demos cover multi-stage, bimanual, and reactive behaviors in unstructured scenes without environment models, which is the practical payoff they emphasize.

Referee Report

2 major / 2 minor

Summary. The paper introduces Relational Keypoint Constraints (ReKep) as Python functions that map sets of 3D keypoints to scalar costs. It claims that representing manipulation tasks as sequences of such constraints enables a hierarchical optimization procedure to produce real-time sequences of SE(3) end-effector poses, and that large vision and vision-language models can automatically generate the required ReKep functions from free-form language instructions and RGB-D observations. Physical system demonstrations on a wheeled single-arm platform and a stationary dual-arm platform are presented for multi-stage, bimanual, in-the-wild, and reactive tasks without task-specific data or environment models.

Significance. If the VLM-generated constraints prove reliable, the work would provide a practical route to versatile, label-free manipulation by composing off-the-shelf perception models with standard optimization solvers, achieving real-time closed-loop control on two distinct physical platforms. The hierarchical formulation and perception-action loop are technically coherent, but the absence of quantitative metrics, ablations, or bounded-error analysis on the generation step limits the strength of the central claim.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): no quantitative success rates, timing statistics, ablation studies, or failure-case analysis are reported for the hierarchical optimizer or the VLM-generated constraints, despite these being required to substantiate real-time convergence and reliability across the claimed task variety.
[§3.3] §3.3 (Automated ReKep Generation): the procedure that prompts VLMs to emit Python constraint functions contains no verification step, numerical stability checks, or empirical evaluation of error modes (incorrect keypoint indexing, non-differentiable operations, or incomplete temporal sequencing), which directly undermines the claim that manual specification can be circumvented for arbitrary tasks.

minor comments (2)

[§3.1] Notation for the keypoint set and cost functions is introduced without a compact mathematical definition before the Python implementation; a short formalization would improve clarity.
[Abstract] The website link is given but no supplementary video timestamps or failure examples are referenced in the text, making it harder for readers to locate the supporting demonstrations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the quantitative support and evaluation of the automated generation process.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): no quantitative success rates, timing statistics, ablation studies, or failure-case analysis are reported for the hierarchical optimizer or the VLM-generated constraints, despite these being required to substantiate real-time convergence and reliability across the claimed task variety.

Authors: We acknowledge that the manuscript currently emphasizes qualitative demonstrations to illustrate versatility across diverse tasks. In the revision, we will expand §4 with quantitative success rates from repeated trials on representative tasks, timing statistics for the full perception-action loop and optimizer, ablation studies isolating the hierarchical components, and a dedicated failure-case analysis. These additions will directly support the claims of real-time convergence and reliability. revision: yes
Referee: [§3.3] §3.3 (Automated ReKep Generation): the procedure that prompts VLMs to emit Python constraint functions contains no verification step, numerical stability checks, or empirical evaluation of error modes (incorrect keypoint indexing, non-differentiable operations, or incomplete temporal sequencing), which directly undermines the claim that manual specification can be circumvented for arbitrary tasks.

Authors: We agree that additional safeguards and empirical evaluation are warranted. The revised §3.3 will include a verification step that invokes a Python interpreter to detect syntax errors and basic numerical instabilities (such as division by zero or non-differentiable operations). We will also add an empirical breakdown of observed error modes across tested tasks, including incorrect keypoint indexing and incomplete temporal sequencing, together with the mitigation strategies employed in the current implementation. revision: yes

standing simulated objections not resolved

A formal bounded-error analysis of the VLM-generated constraints is not feasible in this work, as it would require theoretical guarantees on large vision-language models that are currently unavailable.

Circularity Check

0 steps flagged

No circularity: system relies on external VLMs and standard solvers

full rationale

The paper defines ReKep as Python functions from 3D keypoints to costs, then uses a hierarchical optimizer on SE(3) poses and delegates generation of those functions to off-the-shelf large vision and vision-language models. No equations or procedures inside the paper reduce by construction to fitted parameters, self-citations, or renamed inputs; the central claims rest on the external models' capabilities and the optimizer's standard behavior rather than any internal derivation that loops back to the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that off-the-shelf vision-language models can produce executable, optimizable constraint functions and that hierarchical optimization over SE(3) poses will converge in real time for the generated costs; no free parameters are explicitly fitted inside the paper, and no new physical entities are postulated.

axioms (2)

domain assumption Large vision and language models can map free-form language and RGB-D observations to correct Python constraint functions without task-specific fine-tuning.
Invoked in the automated procedure section of the abstract.
domain assumption Hierarchical optimization of sequences of keypoint costs produces feasible real-time robot trajectories in SE(3).
Central to the perception-action loop claim.

pith-pipeline@v0.9.0 · 5581 in / 1542 out tokens · 26476 ms · 2026-05-16T08:21:03.984396+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
cs.RO 2026-02 unverdicted novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances
cs.RO 2026-04 unverdicted novelty 6.0

BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly
cs.RO 2026-04 unverdicted novelty 6.0

AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
cs.CV 2025-03 unverdicted novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

Forecast-GS predicts task-completed 3D states via Gaussian splatting to achieve higher success rates than baselines in real-world language-conditioned manipulation tasks.
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...
Synergizing Efficiency and Reliability for Continuous Mobile Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

A framework integrates anticipatory planning and real-time feedback via reliability-aware optimization and phase switching to achieve efficient, reliable continuous mobile manipulation under uncertainty.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · cited by 18 Pith papers · 18 internal anchors

[1]

L. P. Kaelbling and T. Lozano-P ´erez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010

work page 2010
[2]

Driess, J.-S

D. Driess, J.-S. Ha, M. Toussaint, and R. Tedrake. Learning models as functionals of signed- distance fields for manipulation planning. In Conference on robot learning, pages 245–255. PMLR, 2022

work page 2022
[3]

Simeonov, Y

A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitz- mann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA) , pages 6394–6400. IEEE, 2022

work page 2022
[4]

Manuelli, W

L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for category- level robotic manipulation. In The International Symposium of Robotics Research , pages 132–157. Springer, 2019

work page 2019
[5]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. arXiv, 2023

work page 2023
[7]

Toussaint, J

M. Toussaint, J. Harris, J.-S. Ha, D. Driess, and W. H ¨onig. Sequence-of-constraints mpc: Reactive timing-optimal control of sequential manipulation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13753–13760. IEEE, 2022

work page 2022
[8]

L. P. Kaelbling and T. Lozano-P ´erez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

work page 2013
[9]

Srivastava, E

S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In 2014 IEEE international conference on robotics and automation (ICRA), 2014

work page 2014
[10]

Byravan and D

A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 173–180. IEEE, 2017

work page 2017
[11]

N. T. Dantam, Z. K. Kingston, S. Chaudhuri, and L. E. Kavraki. An incremental constraint- based framework for task and motion planning. The International Journal of Robotics Re- search, 37(10):1134–1151, 2018

work page 2018
[12]

Migimatsu and J

T. Migimatsu and J. Bohg. Object-centric task and motion planning in dynamic environments. IEEE Robotics and Automation Letters, 5(2):844–851, 2020

work page 2020
[13]

C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-P´erez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4:265–293, 2021. 9

work page 2021
[14]

Curtis, X

A. Curtis, X. Fang, L. P. Kaelbling, T. Lozano-P ´erez, and C. R. Garrett. Long-horizon ma- nipulation of unknown objects via task and motion planning with estimated affordances. In 2022 International Conference on Robotics and Automation (ICRA), pages 1940–1946. IEEE, 2022

work page 2022
[15]

Labb´e, L

Y . Labb´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic. Megapose: 6d pose estimation of novel objects via render & compare. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022

work page 2022
[16]

Tyree, J

S. Tyree, J. Tremblay, T. To, J. Cheng, T. Mosier, J. Smith, and S. Birchfield. 6-dof pose esti- mation of household objects for robotic manipulation: An accessible dataset and benchmark. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13081–13088. IEEE, 2022

work page 2022
[17]

C. Pan, B. Okorn, H. Zhang, B. Eisner, and D. Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Conference on Robot Learning , pages 1783–1792. PMLR, 2023

work page 2023
[18]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. arXiv preprint arXiv:2312.08344, 2023

work page arXiv 2023
[19]

I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, volume 10. Rome, Italy, 2015

work page 2015
[20]

M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Battaglia, R

P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems, 29, 2016

work page 2016
[22]

Sanchez-Gonzalez, N

A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. In International Conference on Machine Learning, pages 4470–4479. PMLR, 2018

work page 2018
[23]

E. Jang, C. Devin, V . Vanhoucke, and S. Levine. Grasp2vec: Learning object representations from self-supervised grasping. arXiv preprint arXiv:1811.06964, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birchfield. Deep ob- ject pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Z. Xu, J. Wu, A. Zeng, J. B. Tenenbaum, and S. Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. arXiv preprint arXiv:1906.03853, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[26]

J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[27]

C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Ler- chner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[28]

Hewing, K

L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger. Learning-based model pre- dictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 3:269–296, 2020. 10

work page 2020
[29]

Locatello, D

F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf. Object-centric learning with slot attention. Advances in neural information processing systems, 33:11525–11538, 2020

work page 2020
[30]

Heravi, A

N. Heravi, A. Wahid, C. Lynch, P. Florence, T. Armstrong, J. Tompson, P. Sermanet, J. Bohg, and D. Dwibedi. Visuomotor control in multi-object scenes using object-aware representa- tions. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 9515–9522. IEEE, 2023

work page 2023
[31]

Y . Zhu, Z. Jiang, P. Stone, and Y . Zhu. Learning generalizable manipulation policies with object-centric 3d representations. arXiv preprint arXiv:2310.14386, 2023

work page arXiv 2023
[32]

W. Yuan, C. Paxton, K. Desingh, and D. Fox. Sornet: Spatial object-centric representations for sequential manipulation. In Conference on Robot Learning, pages 148–157. PMLR, 2022

work page 2022
[33]

Cheng, C

S. Cheng, C. Garrett, A. Mandlekar, and D. Xu. Nod-tamp: Multi-step manipulation planning with neural object descriptors. arXiv preprint arXiv:2311.01530, 2023

work page arXiv 2023
[34]

J. Hsu, J. Mao, J. Tenenbaum, and J. Wu. What’s left? concept grounding with logic-enhanced foundation models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Y . Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

X. Lin, C. Qi, Y . Zhang, Z. Huang, K. Fragkiadaki, Y . Li, C. Gan, and D. Held. Planning with spatial-temporal abstraction from point clouds for deformable object manipulation. arXiv preprint arXiv:2210.15751, 2022

work page arXiv 2022
[37]

Y . Wang, Y . Li, K. Driggs-Campbell, L. Fei-Fei, and J. Wu. Dynamic-resolution model learning for object pile manipulation. arXiv preprint arXiv:2306.16700, 2023

work page arXiv 2023
[38]

H. Shi, H. Xu, S. Clarke, Y . Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023

work page arXiv 2023
[39]

X. Lin, Y . Wang, Z. Huang, and D. Held. Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, pages 256–266. PMLR, 2022

work page 2022
[40]

Abou-Chakra, K

J. Abou-Chakra, K. Rana, F. Dayoub, and N. S ¨underhauf. Physically embodied gaussian splatting: A realtime correctable world model for robotics. arXiv preprint arXiv:2406.10788, 2024

work page arXiv 2024
[41]

Bauer, Z

D. Bauer, Z. Xu, and S. Song. Doughnet: A visual predictive model for topological manipu- lation of deformable objects. arXiv preprint arXiv:2404.12524, 2024

work page arXiv 2024
[42]

Schmidt, R

T. Schmidt, R. Newcombe, and D. Fox. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2(2):420–427, 2016

work page 2016
[43]

P. R. Florence, L. Manuelli, and R. Tedrake. Dense object nets: Learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V . Mnih. Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019

work page 2019
[45]

Z. Qin, K. Fang, Y . Zhu, L. Fei-Fei, and S. Savarese. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7278–7285. IEEE, 2020. 11

work page 2020
[46]

Sundaresan, J

P. Sundaresan, J. Grannen, B. Thananjeyan, A. Balakrishna, M. Laskey, K. Stone, J. E. Gon- zalez, and K. Goldberg. Learning rope manipulation policies using dense object descriptors trained on synthetic depth data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9411–9418. IEEE, 2020

work page 2020
[47]

Manuelli, Y

L. Manuelli, Y . Li, P. Florence, and R. Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085 , 2020

work page arXiv 2009
[48]

B. Chen, P. Abbeel, and D. Pathak. Unsupervised learning of visual 3d keypoints for control. In International Conference on Machine Learning, pages 1539–1549. PMLR, 2021

work page 2021
[49]

Simeonov, Y

A. Simeonov, Y . Du, Y .-C. Lin, A. R. Garcia, L. P. Kaelbling, T. Lozano-P ´erez, and P. Agrawal. Se (3)-equivariant relational rearrangement with neural descriptor fields. In Conference on Robot Learning, pages 835–846. PMLR, 2023

work page 2023
[50]

Vecerik, C

M. Vecerik, C. Doersch, Y . Yang, T. Davchev, Y . Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv preprint arXiv:2308.15975, 2023

work page arXiv 2023
[51]

E. Chun, Y . Du, A. Simeonov, T. Lozano-Perez, and L. Kaelbling. Local neural descriptor fields: Locally conditioned object representations for manipulation. In 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 1830–1836. IEEE, 2023

work page 2023
[52]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

work page 2023
[53]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023

work page arXiv 2023
[54]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation, 2024

work page 2024
[55]

Kingston, M

Z. Kingston, M. Moll, and L. E. Kavraki. Sampling-based methods for motion planning with constraints. Annual review of control, robotics, and autonomous systems, 1:159–185, 2018

work page 2018
[56]

Ratliff, M

N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa. Chomp: Gradient optimization tech- niques for efficient motion planning. In 2009 IEEE international conference on robotics and automation, pages 489–494. IEEE, 2009

work page 2009
[57]

Schulman, Y

J. Schulman, Y . Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan, S. Patil, K. Goldberg, and P. Abbeel. Motion planning with sequential convex optimization and convex collision checking. The International Journal of Robotics Research, 33(9):1251–1270, 2014

work page 2014
[58]

Sundaralingam, S

B. Sundaralingam, S. K. S. Hari, A. Fishman, C. Garrett, K. Van Wyk, V . Blukis, A. Millane, H. Oleynikova, A. Handa, F. Ramos, et al. Curobo: Parallelized collision-free robot motion generation. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 8112–8119. IEEE, 2023

work page 2023
[59]

Marcucci, J

T. Marcucci, J. Umenberger, P. Parrilo, and R. Tedrake. Shortest paths in graphs of convex sets. SIAM Journal on Optimization, 34(1):507–532, 2024

work page 2024
[60]

N. D. Ratliff, J. Issac, D. Kappler, S. Birchfield, and D. Fox. Riemannian motion policies. arXiv preprint arXiv:1801.02854, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

M. Posa, S. Kuindersma, and R. Tedrake. Optimization and stabilization of trajectories for constrained dynamical systems. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1366–1373. IEEE, 2016. 12

work page 2016
[62]

Mordatch, E

I. Mordatch, E. Todorov, and Z. Popovi ´c. Discovery of complex behaviors through contact- invariant optimization. ACM Transactions on Graphics (ToG), 31(4):1–8, 2012

work page 2012
[63]

Mordatch, Z

I. Mordatch, Z. Popovi ´c, and E. Todorov. Contact-invariant optimization for hand manipula- tion. In Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer anima- tion, pages 137–144, 2012

work page 2012
[64]

M. Posa, C. Cantu, and R. Tedrake. A direct method for trajectory optimization of rigid bodies through contact. The International Journal of Robotics Research, 33(1):69–81, 2014

work page 2014
[65]

Howell, N

T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y . Tassa. Predictive sampling: Real-time behaviour synthesis with mujoco. arXiv preprint arXiv:2212.00541 , 2022

work page arXiv 2022
[66]

Z. Liu, G. Zhou, J. He, T. Marcucci, F.-F. Li, J. Wu, and Y . Li. Model-based control with sparse neural dynamics. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[67]

K. M. Lynch and M. T. Mason. Stable pushing: Mechanics, controllability, and planning.The international journal of robotics research, 15(6):533–556, 1996

work page 1996
[68]

Y . Hou, Z. Jia, and M. T. Mason. Fast planning for 3d any-pose-reorienting using pivoting. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1631–1638. IEEE, 2018

work page 2018
[69]

Sleiman, F

J.-P. Sleiman, F. Farshidian, and M. Hutter. Versatile multicontact planning and control for legged loco-manipulation. Science Robotics, 8(81):eadg5014, 2023

work page 2023
[70]

Yang and M

W. Yang and M. Posa. Dynamic on-palm manipulation via controlled sliding. arXiv preprint arXiv:2405.08731, 2024

work page arXiv 2024
[71]

B. P. Graesdal, S. Y . Chia, T. Marcucci, S. Morozov, A. Amice, P. A. Parrilo, and R. Tedrake. Towards tight convex relaxations for contact-rich manipulation. arXiv preprint arXiv:2402.10312, 2024

work page arXiv 2024
[72]

Yunt and C

K. Yunt and C. Glocker. Trajectory optimization of mechanical hybrid systems using sumt. In 9th IEEE International Workshop on Advanced Motion Control, 2006. , pages 665–671. IEEE, 2006

work page 2006
[73]

Yunt and C

K. Yunt and C. Glocker. A combined continuation and penalty method for the determination of optimal hybrid mechanical trajectories. In Iutam Symposium on Dynamics and Control of Nonlinear Systems with Uncertainty: Proceedings of the IUTAM Symposium held in Nanjing, China, September 18-22, 2006, pages 187–196. Springer, 2007

work page 2006
[74]

K. Yunt. An augmented lagrangian based shooting method for the optimal trajectory gen- eration of switching lagrangian systems. Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications and Algorithms, 18(5):615–645, 2011

work page 2011
[75]

Lagriffoul, D

F. Lagriffoul, D. Dimitrov, A. Saffiotti, and L. Karlsson. Constraint propagation on interval bounds for dealing with geometric backtracking. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 957–964. IEEE, 2012

work page 2012
[76]

Lagriffoul, D

F. Lagriffoul, D. Dimitrov, J. Bidot, A. Saffiotti, and L. Karlsson. Efficiently combining task and motion planning using geometric constraints. The International Journal of Robotics Research, 33(14):1726–1747, 2014

work page 2014
[77]

Lozano-P ´erez and L

T. Lozano-P ´erez and L. P. Kaelbling. A constraint-based method for solving sequential ma- nipulation planning problems. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3684–3691. IEEE, 2014. 13

work page 2014
[78]

Z. Yang, J. Mao, Y . Du, J. Wu, J. B. Tenenbaum, T. Lozano-P´erez, and L. P. Kaelbling. Com- positional diffusion-based continuous constraint solvers. arXiv preprint arXiv:2309.00966, 2023

work page arXiv 2023
[79]

Silver, R

T. Silver, R. Chitnis, J. Tenenbaum, L. P. Kaelbling, and T. Lozano-P ´erez. Learning sym- bolic operators for task and motion planning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3182–3189. IEEE, 2021

work page 2021
[80]

B. Vu, T. Migimatsu, and J. Bohg. Coast: Constraints and streams for task and motion planning. arXiv preprint arXiv:2405.08572, 2024

work page arXiv 2024

Showing first 80 references.