{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:AU4OE26SURC4YOGHCIKZG2RO6K","short_pith_number":"pith:AU4OE26S","schema_version":"1.0","canonical_sha256":"0538e26bd2a445cc38c71215936a2ef29a4cec6355412a7391ab2d9a77754a15","source":{"kind":"arxiv","id":"2409.01652","version":2},"attestation_state":"computed","paper":{"title":"ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations.","cross_cats":["cs.AI","cs.CV"],"primary_cat":"cs.RO","authors_text":"Chen Wang, Li Fei-Fei, Ruohan Zhang, Wenlong Huang, Yunzhu Li","submitted_at":"2024-09-03T06:45:22Z","abstract_excerpt":"Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoin"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.01652","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2024-09-03T06:45:22Z","cross_cats_sorted":["cs.AI","cs.CV"],"title_canon_sha256":"93a498b7c81204f6e0578dac255065e9366313125d4cb49b8e15fb629cfd4561","abstract_canon_sha256":"66448a9fef71bb9aeed2a6f013fc0254372fead7269858ee59f22b2508bdd824"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.512067Z","signature_b64":"Rrkm57LYiIJKGsNrv9LCCFlF30BMYBR5kAKETxqSnldMjXsPx3jqiaU256sK8yBxRH1i7g5B7vFlwc1irrxCBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0538e26bd2a445cc38c71215936a2ef29a4cec6355412a7391ab2d9a77754a15","last_reissued_at":"2026-05-17T23:38:48.511579Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.511579Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations.","cross_cats":["cs.AI","cs.CV"],"primary_cat":"cs.RO","authors_text":"Chen Wang, Li Fei-Fei, Ruohan Zhang, Wenlong Huang, Yunzhu Li","submitted_at":"2024-09-03T06:45:22Z","abstract_excerpt":"Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The vision-language models will reliably generate correct, complete, and numerically stable Python constraint functions for arbitrary new tasks and scenes without introducing errors that break the optimizer or produce unsafe actions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"61b32d4da4c59649100ce2faeeaacf3e422ed4b3daf49b98fa22034105daa9a8"},"source":{"id":"2409.01652","kind":"arxiv","version":2},"verdict":{"id":"95f3c8eb-2769-4d38-9e0a-61e7a99990e1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:21:03.984396Z","strongest_claim":"by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations.","one_line_summary":"ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The vision-language models will reliably generate correct, complete, and numerically stable Python constraint functions for arbitrary new tasks and scenes without introducing errors that break the optimizer or produce unsafe actions.","pith_extraction_headline":"Manipulation tasks are solved in real time by optimizing sequences of relational keypoint constraints generated automatically from language instructions and RGB-D observations."},"references":{"count":158,"sample":[{"doi":"","year":2010,"title":"L. P. Kaelbling and T. Lozano-P ´erez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010","work_id":"a6578317-bf75-45a7-9e33-ce65244b424b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"D. Driess, J.-S. Ha, M. Toussaint, and R. Tedrake. Learning models as functionals of signed- distance fields for manipulation planning. In Conference on robot learning, pages 245–255. PMLR, 2022","work_id":"d53daa3a-8552-41d9-a445-e9dc89c20675","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitz- mann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 Inter","work_id":"15655890-22dd-459b-860e-a12d2e639a40","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for category- level robotic manipulation. In The International Symposium of Robotics Research , pages 132–157. Springer, 20","work_id":"960540bb-ce99-42de-b544-f33a9367ff1a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","ref_index":5,"cited_arxiv_id":"2304.07193","is_internal_anchor":true}],"resolved_work":158,"snapshot_sha256":"14e449c0b341ad545928328a4e7bde2ade54eb517813687e111d0c53d8a18148","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"9654112c60ac126aa23fe782063d46b9f9fbb167f6a2c3d6d88ba6f94a53f4cf"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.01652","created_at":"2026-05-17T23:38:48.511669+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.01652v2","created_at":"2026-05-17T23:38:48.511669+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.01652","created_at":"2026-05-17T23:38:48.511669+00:00"},{"alias_kind":"pith_short_12","alias_value":"AU4OE26SURC4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"AU4OE26SURC4YOGH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"AU4OE26S","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2605.23856","citing_title":"Point Tracking Improves World Action Models","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14093","citing_title":"A Survey on Vision-Language-Action Models for Embodied AI","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22183","citing_title":"Action with Visual Primitives","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19102","citing_title":"FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2509.14787","citing_title":"COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2507.00990","citing_title":"Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2508.13998","citing_title":"Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2512.01773","citing_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2601.07060","citing_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08392","citing_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2503.22020","citing_title":"CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2503.10631","citing_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14274","citing_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13778","citing_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04974","citing_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11951","citing_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11144","citing_title":"Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10307","citing_title":"PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23249","citing_title":"BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05714","citing_title":"TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01448","citing_title":"Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21241","citing_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2501.09747","citing_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07306","citing_title":"BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08983","citing_title":"AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly","ref_index":20,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K","json":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K.json","graph_json":"https://pith.science/api/pith-number/AU4OE26SURC4YOGHCIKZG2RO6K/graph.json","events_json":"https://pith.science/api/pith-number/AU4OE26SURC4YOGHCIKZG2RO6K/events.json","paper":"https://pith.science/paper/AU4OE26S"},"agent_actions":{"view_html":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K","download_json":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K.json","view_paper":"https://pith.science/paper/AU4OE26S","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.01652&json=true","fetch_graph":"https://pith.science/api/pith-number/AU4OE26SURC4YOGHCIKZG2RO6K/graph.json","fetch_events":"https://pith.science/api/pith-number/AU4OE26SURC4YOGHCIKZG2RO6K/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K/action/timestamp_anchor","attest_storage":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K/action/storage_attestation","attest_author":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K/action/author_attestation","sign_citation":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K/action/citation_signature","submit_replication":"https://pith.science/pith/AU4OE26SURC4YOGHCIKZG2RO6K/action/replication_record"}},"created_at":"2026-05-17T23:38:48.511669+00:00","updated_at":"2026-05-17T23:38:48.511669+00:00"}