SkillWrapper: Generative Predicate Invention for Task-level Planning
Pith reviewed 2026-05-17 05:39 UTC · model grok-4.3
The pith
A formal theory of generative predicate invention produces symbolic operators for provably sound and complete robot task planning from RGB images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a formal theory of generative predicate invention for skill abstraction, resulting in symbolic operators that can be used for provably sound and complete planning. SkillWrapper implements the theory by using foundation models to actively collect robot data and learn human-interpretable, plannable representations of black-box skills from RGB image observations alone, with empirical validation in simulation and on physical robots for long-horizon tasks.
What carries the argument
The formal theory of generative predicate invention, which defines the conditions under which generated predicates yield symbolic operators that preserve soundness and completeness for domain-independent planning.
If this is right
- The resulting symbolic operators integrate directly with standard domain-independent planners for high-level task reasoning.
- Representations learned in simulation or from collected data enable solving long-horizon tasks that were not encountered during training.
- Planning proceeds using only RGB images even when the underlying skills remain black boxes with no exposed state.
- The same learned abstractions support both simulated training and direct real-robot deployment without additional engineering.
Where Pith is reading between the lines
- If the formal properties transfer reliably, the method could reduce reliance on manually engineered predicates across many robot domains.
- Active data collection guided by the theory might be adapted to handle partial observability or sensor noise in more complex settings.
- The predicate invention process could be tested for compatibility with other high-level planners or combined with learned low-level controllers.
Load-bearing premise
The predicates generated by the foundation model must satisfy the formal completeness and soundness conditions required by the theory, and these properties must transfer when the black-box skills run on real robots from image inputs.
What would settle it
A concrete counterexample in which a plan produced by the learned operators cannot reach the goal despite each individual skill executing correctly on the robot would falsify the claim that the operators are sound and complete.
Figures
read the original abstract
Generalizing from individual skill executions to solving long-horizon tasks remains a core challenge in building autonomous agents. A promising direction is learning high-level, symbolic abstractions of the low-level skills of the agents, enabling reasoning and planning independent of the low-level state space. Among possible high-level representations, object-centric skill abstraction with symbolic predicates has been proven to be efficient because of its compatibility with domain-independent planners. Recent advances in foundation models have made it possible to generate symbolic predicates that operate on raw sensory inputs, a process we call generative predicate invention, to facilitate downstream abstraction learning. However, it remains unclear which formal properties the learned representations must satisfy, and how they can be learned to guarantee these properties. In this paper, we address both questions by presenting a formal theory of generative predicate invention for skill abstraction, resulting in symbolic operators that can be used for provably sound and complete planning. Within this framework, we propose SkillWrapper, a method that leverages foundation models to actively collect robot data and learn human-interpretable, plannable representations of black-box skills, using only RGB image observations. Our extensive empirical evaluation in simulation and on real robots shows that SkillWrapper learns abstract representations that enable solving unseen, long-horizon tasks in the real world with black-box skills.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a formal theory of generative predicate invention for skill abstraction, which produces symbolic operators suitable for provably sound and complete planning. SkillWrapper is proposed as a practical method that employs foundation models to actively gather robot data from RGB observations and learn interpretable, plannable representations of black-box skills. Extensive experiments in simulation and on physical robots demonstrate the approach's ability to solve previously unseen long-horizon tasks.
Significance. Should the generated predicates reliably satisfy the formal conditions and the learned representations transfer effectively to real-world execution, this contribution would be significant. It bridges data-driven foundation models with symbolic AI planning, offering a pathway to guaranteed performance in complex robotic tasks without requiring full state observability or hand-crafted abstractions.
major comments (2)
- [§3] The formal theory claims to yield provably sound and complete planning from predicates that meet specific conditions (e.g., accurate state classification and preservation of transition semantics). However, the generative process in SkillWrapper, which relies on foundation models trained on limited trajectories, provides no enforcement or verification mechanism to ensure these conditions are met, particularly regarding completeness over the full state space or under real-robot distribution shifts.
- [§5] The empirical evaluation summarizes results at a high level without error bars, detailed baselines, or explicit exclusion criteria for successful task executions. This limits the ability to verify whether the performance gains support the central claim of enabling reliable planning for unseen tasks with black-box skills.
minor comments (2)
- [Abstract] The abstract mentions 'extensive empirical evaluation' but provides no quantitative details; consider adding key metrics or success rates to better convey the strength of the results.
- [Notation] Some notation for the invented predicates and operators could be clarified earlier in the paper to aid readers unfamiliar with the formal framework.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with revisions indicated where appropriate to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] The formal theory claims to yield provably sound and complete planning from predicates that meet specific conditions (e.g., accurate state classification and preservation of transition semantics). However, the generative process in SkillWrapper, which relies on foundation models trained on limited trajectories, provides no enforcement or verification mechanism to ensure these conditions are met, particularly regarding completeness over the full state space or under real-robot distribution shifts.
Authors: We appreciate the referee's emphasis on the distinction between the formal theory and its practical realization. Section 3 presents sufficient conditions on predicates that guarantee sound and complete planning when those conditions hold; the theory itself is agnostic to the method of predicate generation. SkillWrapper is a practical, data-driven procedure that uses foundation models to propose predicates from limited RGB trajectories. We do not claim a formal enforcement or verification procedure, as exhaustive verification of completeness over the full (potentially continuous) state space is intractable and would be further complicated by distribution shifts on real robots. Instead, we rely on empirical validation across simulation and physical experiments showing successful planning on unseen long-horizon tasks. In the revised manuscript we will add a new subsection in §3 that explicitly discusses the gap between the theoretical conditions and the learned predicates, including potential failure modes under distribution shift and the role of empirical evidence in supporting the claims. revision: partial
-
Referee: [§5] The empirical evaluation summarizes results at a high level without error bars, detailed baselines, or explicit exclusion criteria for successful task executions. This limits the ability to verify whether the performance gains support the central claim of enabling reliable planning for unseen tasks with black-box skills.
Authors: We agree that the current empirical presentation would benefit from greater detail and transparency. In the revised version we will augment all tables and figures with error bars (standard deviation across repeated trials), expand the description of baselines and ablations with explicit implementation details, and add a dedicated paragraph specifying the success criteria and any exclusion rules used for task executions. These additions will make the performance gains more verifiable and directly support the central claim. revision: yes
Circularity Check
No significant circularity; formal theory and method are independent
full rationale
The paper introduces a formal theory of generative predicate invention that yields symbolic operators for provably sound and complete planning, conditional on predicates satisfying stated properties such as accurate state classification and transition preservation. SkillWrapper then uses foundation models and active data collection from RGB observations to produce those predicates. No equations, self-referential definitions, or reductions appear that make the planning guarantees equivalent to fitted parameters or prior self-citations by construction. The derivation relies on external foundation models and robot data, keeping the central claims self-contained rather than circular. This matches the default expectation for papers without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generated predicates satisfy the formal properties needed for sound and complete planning
invented entities (1)
-
Generative predicates invented by foundation models
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
BISON learns bilevel policies over symbolic world models to generalize long-horizon robotic planning beyond VLA and end-to-end baselines while remaining efficient even at 10,000-object scale.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances . In Proceedings of the 6th Conference on Robot Learning (CoRL), pp.\ 287--318, 14--18 Dec 2022
work page 2022
-
[3]
Auto RT : Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montserrat Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, et al. Auto RT : Embodied Foundation Models for Large Scale Orchestration of Robotic Agents . In First Workshop on Vision-Language Models for Navigation and Manipulation (VLMNM) at ICRA 2024, 2024
work page 2024
-
[4]
A Review of Learning Planning Action Models
Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Métivier, and Sylvie Pesty. A Review of Learning Planning Action Models . The Knowledge Engineering Review, 33: 0 e20, 2018
work page 2018
-
[5]
Predicate Invention from Pixels via Pretrained Vision-Language Models
Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tom \'a s Lozano-P \'e rez, and Leslie Pack Kaelbling. Predicate Invention from Pixels via Pretrained Vision-Language Models . In AAAI 2025 Workshop on Language Models for Planning (LM4Plan), 2025
work page 2025
-
[6]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control . In Proceedings of the 7th Conference on Robot Learning, pp.\ 2165--2183, 06--09 Nov 2023
work page 2023
-
[8]
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14455--14465, 2024
work page 2024
-
[9]
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-Language Models Provide Promptable Representations for Reinforcement Learning . Transactions on Machine Learning Research (TMLR), 2025. ISSN 2835-8856
work page 2025
-
[10]
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models . In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, 2024
work page 2024
-
[11]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024
work page 2024
-
[12]
An incremental constraint-based framework for task and motion planning
Neil T Dantam, Zachary K Kingston, Swarat Chaudhuri, and Lydia E Kavraki. An incremental constraint-based framework for task and motion planning. The International Journal of Robotics Research, 37 0 (10): 0 1134--1151, 2018
work page 2018
-
[13]
S. Doncieux, D. Filliat, N. D \' az-Rodr \' guez, T. Hospedales, R. Duro, A. Coninx, D.M. Roijers, B. Girard, N. Perrin, and O. Sigaud. Open-ended learning: a conceptual framework based on representational redescription. Frontiers in Neurorobotics, 12: 0 59, 2018
work page 2018
-
[14]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An Embodied ...
work page 2023
-
[15]
Adaptive Procedural Task Generation for Hard-Exploration Problems
Kuan Fang, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. Adaptive Procedural Task Generation for Hard-Exploration Problems . In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[16]
Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, and Jeannette Bohg. Active Task Randomization: Learning Robust Skills via Unsupervised Generation of Diverse and Feasible Tasks . Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1--8, 2022
work page 2023
-
[17]
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting . Proceedings of Robotics: Science and Systems (RSS) XX, 2024
work page 2024
-
[18]
Integrated Task and Motion Planning
Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tom \'a s Lozano-P \'e rez. Integrated Task and Motion Planning . Annual Review of Control, Robotics, and Autonomous Systems, 4: 0 265--293, 2021
work page 2021
-
[19]
Robotouille: An Asynchronous Planning Benchmark for LLM Agents
Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. Robotouille: An Asynchronous Planning Benchmark for LLM Agents . In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[20]
Multi-skill Mobile Manipulation for Object Rearrangement
Jiayuan Gu, Devendra Singh Chaplot, Hao Su, and Jitendra Malik. Multi-skill Mobile Manipulation for Object Rearrangement . In Proceedings of the 11th International Conference on Learning Representations (ICML), 2022
work page 2022
-
[21]
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition
Huy Ha, Pete Florence, and Shuran Song. Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition . In Proceedings of the 7th Conference on Robot Learning (CoRL), pp.\ 3766--3777, 2023
work page 2023
-
[22]
InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning
Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024
work page 2024
-
[23]
3D-LLM: Injecting the 3D World into Large Language Models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D World into Large Language Models . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 20482--20494, 2023
work page 2023
-
[24]
Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,
Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning . arXiv preprint arXiv:2311.17842, 2023
-
[25]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents . In Proceedings of the 39th International Conference on Machine Learning (ICML), pp.\ 9118--9147, 2022
work page 2022
-
[26]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models . In Proceedings of the 6th Conference on Ro...
work page 2023
-
[27]
RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation . In Proceedings of the 8th Conference on Robot Learning, pp.\ 3027--3052, 2025
work page 2025
-
[28]
Minqi Jiang, Edward Grefenstette, and Tim Rockt \"a schel. Prioritized Level Replay . In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 4940--4950. PMLR, 2021
work page 2021
-
[29]
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 2901--2910, 2017
work page 2017
-
[30]
Brendan Juba, Hai S. Le, and Roni Stern. Safe Learning of Lifted Action Models . In Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning (KR) , pp.\ 379--389, 11 2021
work page 2021
-
[31]
Position: LLM s Can t Plan, But Can Help Planning in LLM -Modulo Frameworks
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLM s Can t Plan, But Can Help Planning in LLM -Modulo Frameworks . In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[32]
K* and partial order reduction for top-quality planning
Michael Katz and Junkyu Lee. K* and partial order reduction for top-quality planning. In Proceedings of the 16th Annual Symposium on Combinatorial Search (SoCS 2023). AAAI Press, 2023
work page 2023
-
[33]
On the Necessity of Abstraction
George Konidaris. On the Necessity of Abstraction . Current Opinion in Behavioral Sciences, 29: 0 1--7, 2019. ISSN 2352-1546
work page 2019
-
[34]
Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
George Konidaris and Andrew Barto. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining . In Advances in Neural Information Processing Systems (NIPS), volume 22, 2009
work page 2009
-
[35]
From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning
George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Pérez. From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning . Journal of Artificial Intelligence Research, 61: 0 215--289, 2018
work page 2018
-
[36]
Planning for Learning Object Properties
Leonardo Lamanna, Luciano Serafini, Mohamadreza Faridghasemnia, Alessandro Saffiotti, Alessandro Saetti, Alfonso Gerevini, and Paolo Traverso. Planning for Learning Object Properties . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (10): 0 12005--12013, Jun. 2023
work page 2023
-
[37]
Embodied Active Learning of Relational State Abstractions for Bilevel Planning
Amber Li and Tom Silver. Embodied Active Learning of Relational State Abstractions for Bilevel Planning . In Proceedings of The 2nd Conference on Lifelong Learning Agents (CoLLAs), pp.\ 358--375, 2023
work page 2023
-
[38]
Zhaoyi Li, Kelin Yu, Shuo Cheng, and Danfei Xu. LEAGUE++: Empowering Continual Robot Learning via Guided Skill Acquisition with Large Language Models . In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
work page 2024
-
[39]
Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B. Tenenbaum, Tom Silver, Joao F. Henriques, and Kevin Ellis. VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning . In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[40]
OpenEQA: Embodied Question Answering in the Era of Foundation Models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravind ...
work page 2024
-
[41]
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision . In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[42]
D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins. PDDL -- The Planning Domain Definition Language . Technical report, CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control, 1998
work page 1998
-
[43]
Grounding Predicates through Actions
Toki Migimatsu and Jeannette Bohg. Grounding Predicates through Actions . In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), pp.\ 3498--3504, 2022
work page 2022
-
[44]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024
work page 2024
-
[45]
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, and Brian Ichter. PIVOT: Iterative Visual Prompting Elicits Ac...
work page 2024
-
[46]
OpenAI. Introducing GPT-5 , 2025. URL https://openai.com/index/introducing-gpt-5/. Accessed:
work page 2025
-
[47]
CAPE: Corrective Actions from Precondition Errors using Large Language Models
Shreyas Sundara Raman, Vanya Cohen, Ifrah Idrees, Eric Rosen, Ray Mooney, Stefanie Tellex, and David Paulius. CAPE: Corrective Actions from Precondition Errors using Large Language Models . In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 14070--14077, 2024
work page 2024
-
[48]
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning . In Proceedings of the 7th Conference on Robot Learning (CoRL), volume 229, pp.\ 23--72, 06--09 Nov 2023
work page 2023
-
[49]
Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh
Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until Confident: Efficient Exploration for Embodied Question Answering . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024
work page 2024
-
[50]
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics . In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 645--652. IEEE, 2024
work page 2024
-
[51]
Anytime Integrated Task and Motion Policies for Stochastic Environments
Naman Shah, Deepak Kala Vasudevan, Kislay Kumar, Pranav Kamojjhala, and Siddharth Srivastava. Anytime Integrated Task and Motion Policies for Stochastic Environments . In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9285--9291. IEEE, 2020
work page 2020
-
[52]
Naman Shah, Jayesh Nagpal, Pulkit Verma, and Siddharth Srivastava. From Reals to Logic and Back: Inventing Symbolic Vocabularies, Actions and Models for Planning from Raw Data . arXiv preprint arXiv:2402.11871, 2024
-
[53]
C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27 0 (3): 0 379--423, 1948
work page 1948
-
[54]
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation . In Proceedings of the 6th Conference on Robot Learning, volume 205, pp.\ 785--799, 14--18 Dec 2023
work page 2023
- [55]
-
[56]
Distilling Internet-Scale Vision-Language Models into Embodied Agents
Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, and Ishita Dasgupta. Distilling Internet-Scale Vision-Language Models into Embodied Agents . In Proceedings of the Fortieth International Conference on Machine Learning (ICML), pp.\ 32797--32818, 2023
work page 2023
-
[57]
ViperGPT: Visual Inference via Python Execution for Reasoning
D \' dac Sur \' s, Sachit Menon, and Carl Vondrick. ViperGPT: Visual Inference via Python Execution for Reasoning . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 11888--11898, October 2023
work page 2023
-
[58]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning . Artificial Intelligence, 112 0 (1): 0 181--211, 1999
work page 1999
-
[59]
Habitat 2.0: Training Home Assistants to Rearrange their Habitat
Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladim\' r Vondru s , Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Traini...
work page 2021
-
[60]
On the Planning Abilities of Large Language Models - A Critical Investigation
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the Planning Abilities of Large Language Models - A Critical Investigation . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 75993--76005, 2023
work page 2023
-
[61]
Discovering User-Interpretable Capabilities of Black-Box Planning Agents
Pulkit Verma, Shashank Rao Marpally, and Siddharth Srivastava. Discovering User-Interpretable Capabilities of Black-Box Planning Agents . In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning (KR), volume 19, pp.\ 362--372, 2022
work page 2022
-
[62]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models . Transactions on Machine Learning Research (TMLR), 2024 a . ISSN 2835-8856
work page 2024
-
[63]
Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions . arXiv preprint arXiv:1901.01753, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[64]
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback . In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 51484--51501, 21--27 Jul 2024 b
work page 2024
-
[65]
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation . In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024 c
work page 2024
-
[66]
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects . In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 17868--17879, 2024
work page 2024
-
[67]
Neuro-Symbolic Learning of Lifted Action Models from Visual Traces
Kai Xi, Stephen Gould, and Sylvie Thiébaux. Neuro-Symbolic Learning of Lifted Action Models from Visual Traces . Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 34 0 (1): 0 653--662, May 2024
work page 2024
-
[68]
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied Vision-Language Programmer from Environmental Feedback . In Proceedings of the 2024 European Conference on Computer Vision (ECCV), pp.\ 20--38, 2024
work page 2024
-
[69]
ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation
Naoki Yokoyama, Alex Clegg, Joanne Truong, Eric Undersander, Tsung-Yen Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 9 0 (1): 0 779--786, 2024
work page 2024
-
[70]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[71]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[72]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.