pith. sign in

arxiv: 2511.18203 · v6 · submitted 2025-11-22 · 💻 cs.RO

SkillWrapper: Generative Predicate Invention for Task-level Planning

Pith reviewed 2026-05-17 05:39 UTC · model grok-4.3

classification 💻 cs.RO
keywords generative predicate inventionskill abstractionsymbolic operatorsrobot task planningfoundation modelsRGB observationsblack-box skillslong-horizon tasks
0
0 comments X

The pith

A formal theory of generative predicate invention produces symbolic operators for provably sound and complete robot task planning from RGB images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a formal theory of generative predicate invention that turns foundation-model outputs into symbolic operators supporting sound and complete planning over black-box skills. This matters because it lets agents reason at a high level while executing low-level actions without needing access to internal skill states or hand-designed abstractions. SkillWrapper puts the theory into practice by directing foundation models to collect interaction data and learn human-interpretable representations solely from RGB observations. If the approach holds, robots can solve previously unseen long-horizon tasks by composing learned operators into plans that remain valid when executed in the real world.

Core claim

The authors present a formal theory of generative predicate invention for skill abstraction, resulting in symbolic operators that can be used for provably sound and complete planning. SkillWrapper implements the theory by using foundation models to actively collect robot data and learn human-interpretable, plannable representations of black-box skills from RGB image observations alone, with empirical validation in simulation and on physical robots for long-horizon tasks.

What carries the argument

The formal theory of generative predicate invention, which defines the conditions under which generated predicates yield symbolic operators that preserve soundness and completeness for domain-independent planning.

If this is right

  • The resulting symbolic operators integrate directly with standard domain-independent planners for high-level task reasoning.
  • Representations learned in simulation or from collected data enable solving long-horizon tasks that were not encountered during training.
  • Planning proceeds using only RGB images even when the underlying skills remain black boxes with no exposed state.
  • The same learned abstractions support both simulated training and direct real-robot deployment without additional engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the formal properties transfer reliably, the method could reduce reliance on manually engineered predicates across many robot domains.
  • Active data collection guided by the theory might be adapted to handle partial observability or sensor noise in more complex settings.
  • The predicate invention process could be tested for compatibility with other high-level planners or combined with learned low-level controllers.

Load-bearing premise

The predicates generated by the foundation model must satisfy the formal completeness and soundness conditions required by the theory, and these properties must transfer when the black-box skills run on real robots from image inputs.

What would settle it

A concrete counterexample in which a plan produced by the learned operators cannot reach the goal despite each individual skill executing correctly on the robot would falsify the claim that the operators are sound and complete.

Figures

Figures reproduced from arXiv: 2511.18203 by Ahmed Jaafar, Benned Hedegaard, David Paulius, George Konidaris, Haotian Fu, Naman Shah, Shreyas S. Raman, Skye Thompson, Stefanie Tellex, Yichen Wei, Ziyi Yang.

Figure 1
Figure 1. Figure 1: Overview of SkillWrapper. For an agent equipped with black-box skills, SkillWrap￾per learns skill representations that are compatible with off-the-shelf planners. These representations are comprised of predicates invented by the foundation model. Given a novel planning problem de￾scribed using the initial state and goal state as RGB images, a foundation model produces the corresponding abstract states by a… view at source ↗
Figure 2
Figure 2. Figure 2: Example of Predicate Invention. The initial states of two transitions are both said to satisfy the preconditions of certain operators learned from the same skill, while transition 1 is successful, but transition 2 is not. In this case, the first condition (precondition) is triggered, and the foundation model is prompted with both transitions to invent a new predicate. Empirical predicate selection. Althoug… view at source ↗
Figure 3
Figure 3. Figure 3: Robotouille environment. We first conduct experiments in Robotouille (Gonzalez-Pumariega et al., 2025), which is a simulated grid world kitchen domain with an agent that has five high-level skills: Pick, Place, Cut, Cook, and Stack. In the environment, there are several objects: a patty, lettuce, a top bun, and a bottom bun; there is also a cutting board and a stove for cutting the lettuce and cooking the … view at source ↗
Figure 4
Figure 4. Figure 4: Initial and Goal States for Real Robot Experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sequence of Bimanual Robot Skill Execution with Predicate Value Changes [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bimanual Kuka Scenario Results over 5 iterations with invented predicate and learned [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example task in Robotouille. (a) Initial state (b) Goal state [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example task in Franka. (a) Initial state (b) Goal state [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example task in Bimanual Kuka. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 17
Figure 17. Figure 17: Predicate Invention Case #1 in Franka. Target predicate: GripperEmpty( Existing predicates: ∅ (a) ✓Stack(Bowl, Plate) (b) ×Stack(Bowl, Plate) GPT-5 ✓ plate top empty(?plate) ✓ plate is clean(?plate) ✓ plate is clean(?plate) Qwen3 ✗ stacked on (?pickupable, ?plate) ✗ on center of (?pickupable, ?plate) ✗ is fully supported (?pickupable, ?plate) [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Predicate Invention Case #2 in Franka. Target predicate: PlateIsDirty(? plate) Existing predicates: GripperEmpty(), Holding(? pickupable) (a) ✓Scoop(Knife, Jar) (b) ✗Scoop(Knife, Jar) GPT-5 ✓ Open(?openable) ✓ Open(?openable) ✓ Open(?openable) Qwen3 ✗UtensilInOpenable (?utensil, ?openable) ✗UtensilInOpening (?utensil, ?openable) ✗UtensilInOpenable (?utensil, ?openable) [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 19
Figure 19. Figure 19: Predicate Invention Case #1 in Bi-Kuka. Target predicate: LidOff(? openable) Existing predicates:InLeftGripper(? openable), InRightGripper(? utensil) (a) ✓Open(Jar) (b) ×Open(Jar) GPT-5 ✓RightHandEmpty() ✓RightHandEmpty() ✗ LidAttached(?openable) Qwen3 ✗ FullyEnclosedByLeftGripper (?openable) ✗ FullyEnclosedByLeftGripper (?openable) ✗ FullyEnclosedByLeftGripper (?openable) [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 20
Figure 20. Figure 20: Predicate Invention Case #2 in Bi-Kuka. Target predicate: RightGripperEmpty() Existing predicates:InLeftGripper(? openable), LidOff(? openable) 35 [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
read the original abstract

Generalizing from individual skill executions to solving long-horizon tasks remains a core challenge in building autonomous agents. A promising direction is learning high-level, symbolic abstractions of the low-level skills of the agents, enabling reasoning and planning independent of the low-level state space. Among possible high-level representations, object-centric skill abstraction with symbolic predicates has been proven to be efficient because of its compatibility with domain-independent planners. Recent advances in foundation models have made it possible to generate symbolic predicates that operate on raw sensory inputs, a process we call generative predicate invention, to facilitate downstream abstraction learning. However, it remains unclear which formal properties the learned representations must satisfy, and how they can be learned to guarantee these properties. In this paper, we address both questions by presenting a formal theory of generative predicate invention for skill abstraction, resulting in symbolic operators that can be used for provably sound and complete planning. Within this framework, we propose SkillWrapper, a method that leverages foundation models to actively collect robot data and learn human-interpretable, plannable representations of black-box skills, using only RGB image observations. Our extensive empirical evaluation in simulation and on real robots shows that SkillWrapper learns abstract representations that enable solving unseen, long-horizon tasks in the real world with black-box skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a formal theory of generative predicate invention for skill abstraction, which produces symbolic operators suitable for provably sound and complete planning. SkillWrapper is proposed as a practical method that employs foundation models to actively gather robot data from RGB observations and learn interpretable, plannable representations of black-box skills. Extensive experiments in simulation and on physical robots demonstrate the approach's ability to solve previously unseen long-horizon tasks.

Significance. Should the generated predicates reliably satisfy the formal conditions and the learned representations transfer effectively to real-world execution, this contribution would be significant. It bridges data-driven foundation models with symbolic AI planning, offering a pathway to guaranteed performance in complex robotic tasks without requiring full state observability or hand-crafted abstractions.

major comments (2)
  1. [§3] The formal theory claims to yield provably sound and complete planning from predicates that meet specific conditions (e.g., accurate state classification and preservation of transition semantics). However, the generative process in SkillWrapper, which relies on foundation models trained on limited trajectories, provides no enforcement or verification mechanism to ensure these conditions are met, particularly regarding completeness over the full state space or under real-robot distribution shifts.
  2. [§5] The empirical evaluation summarizes results at a high level without error bars, detailed baselines, or explicit exclusion criteria for successful task executions. This limits the ability to verify whether the performance gains support the central claim of enabling reliable planning for unseen tasks with black-box skills.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive empirical evaluation' but provides no quantitative details; consider adding key metrics or success rates to better convey the strength of the results.
  2. [Notation] Some notation for the invented predicates and operators could be clarified earlier in the paper to aid readers unfamiliar with the formal framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with revisions indicated where appropriate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] The formal theory claims to yield provably sound and complete planning from predicates that meet specific conditions (e.g., accurate state classification and preservation of transition semantics). However, the generative process in SkillWrapper, which relies on foundation models trained on limited trajectories, provides no enforcement or verification mechanism to ensure these conditions are met, particularly regarding completeness over the full state space or under real-robot distribution shifts.

    Authors: We appreciate the referee's emphasis on the distinction between the formal theory and its practical realization. Section 3 presents sufficient conditions on predicates that guarantee sound and complete planning when those conditions hold; the theory itself is agnostic to the method of predicate generation. SkillWrapper is a practical, data-driven procedure that uses foundation models to propose predicates from limited RGB trajectories. We do not claim a formal enforcement or verification procedure, as exhaustive verification of completeness over the full (potentially continuous) state space is intractable and would be further complicated by distribution shifts on real robots. Instead, we rely on empirical validation across simulation and physical experiments showing successful planning on unseen long-horizon tasks. In the revised manuscript we will add a new subsection in §3 that explicitly discusses the gap between the theoretical conditions and the learned predicates, including potential failure modes under distribution shift and the role of empirical evidence in supporting the claims. revision: partial

  2. Referee: [§5] The empirical evaluation summarizes results at a high level without error bars, detailed baselines, or explicit exclusion criteria for successful task executions. This limits the ability to verify whether the performance gains support the central claim of enabling reliable planning for unseen tasks with black-box skills.

    Authors: We agree that the current empirical presentation would benefit from greater detail and transparency. In the revised version we will augment all tables and figures with error bars (standard deviation across repeated trials), expand the description of baselines and ablations with explicit implementation details, and add a dedicated paragraph specifying the success criteria and any exclusion rules used for task executions. These additions will make the performance gains more verifiable and directly support the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formal theory and method are independent

full rationale

The paper introduces a formal theory of generative predicate invention that yields symbolic operators for provably sound and complete planning, conditional on predicates satisfying stated properties such as accurate state classification and transition preservation. SkillWrapper then uses foundation models and active data collection from RGB observations to produce those predicates. No equations, self-referential definitions, or reductions appear that make the planning guarantees equivalent to fitted parameters or prior self-citations by construction. The derivation relies on external foundation models and robot data, keeping the central claims self-contained rather than circular. This matches the default expectation for papers without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects high-level claims rather than explicit equations or sections; the formal theory is presumed to introduce assumptions about predicate properties that are not detailed here.

axioms (1)
  • domain assumption Generated predicates satisfy the formal properties needed for sound and complete planning
    Invoked as the basis for the provable guarantees stated in the abstract.
invented entities (1)
  • Generative predicates invented by foundation models no independent evidence
    purpose: To produce human-interpretable symbolic abstractions of black-box skills from RGB observations
    New postulated mechanism that converts sensory data into plannable operators; no independent falsifiable handle is described in the abstract.

pith-pipeline@v0.9.0 · 5561 in / 1208 out tokens · 34918 ms · 2026-05-17T05:39:17.859665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

    cs.AI 2026-05 unverdicted novelty 6.0

    BISON learns bilevel policies over symbolic world models to generalize long-horizon robotic planning beyond VLA and end-to-end baselines while remaining efficient even at 10,000-object scale.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances . In Proceedings of the 6th Conference on Robot Learning (CoRL), pp.\ 287--318, 14--18 Dec 2022

  3. [3]

    Auto RT : Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

    Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montserrat Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, et al. Auto RT : Embodied Foundation Models for Large Scale Orchestration of Robotic Agents . In First Workshop on Vision-Language Models for Navigation and Manipulation (VLMNM) at ICRA 2024, 2024

  4. [4]

    A Review of Learning Planning Action Models

    Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Métivier, and Sylvie Pesty. A Review of Learning Planning Action Models . The Knowledge Engineering Review, 33: 0 e20, 2018

  5. [5]

    Predicate Invention from Pixels via Pretrained Vision-Language Models

    Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tom \'a s Lozano-P \'e rez, and Leslie Pack Kaelbling. Predicate Invention from Pixels via Pretrained Vision-Language Models . In AAAI 2025 Workshop on Language Models for Planning (LM4Plan), 2025

  6. [6]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control . In Proceedings of the 7th Conference on Robot Learning, pp.\ 2165--2183, 06--09 Nov 2023

  8. [8]

    SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14455--14465, 2024

  9. [9]

    Vision-Language Models Provide Promptable Representations for Reinforcement Learning

    William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-Language Models Provide Promptable Representations for Reinforcement Learning . Transactions on Machine Learning Research (TMLR), 2025. ISSN 2835-8856

  10. [10]

    EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models . In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, 2024

  11. [11]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024

  12. [12]

    An incremental constraint-based framework for task and motion planning

    Neil T Dantam, Zachary K Kingston, Swarat Chaudhuri, and Lydia E Kavraki. An incremental constraint-based framework for task and motion planning. The International Journal of Robotics Research, 37 0 (10): 0 1134--1151, 2018

  13. [13]

    Doncieux, D

    S. Doncieux, D. Filliat, N. D \' az-Rodr \' guez, T. Hospedales, R. Duro, A. Coninx, D.M. Roijers, B. Girard, N. Perrin, and O. Sigaud. Open-ended learning: a conceptual framework based on representational redescription. Frontiers in Neurorobotics, 12: 0 59, 2018

  14. [14]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An Embodied ...

  15. [15]

    Adaptive Procedural Task Generation for Hard-Exploration Problems

    Kuan Fang, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. Adaptive Procedural Task Generation for Hard-Exploration Problems . In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021

  16. [16]

    Active Task Randomization: Learning Robust Skills via Unsupervised Generation of Diverse and Feasible Tasks

    Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, and Jeannette Bohg. Active Task Randomization: Learning Robust Skills via Unsupervised Generation of Diverse and Feasible Tasks . Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1--8, 2022

  17. [17]

    MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

    Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting . Proceedings of Robotics: Science and Systems (RSS) XX, 2024

  18. [18]

    Integrated Task and Motion Planning

    Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tom \'a s Lozano-P \'e rez. Integrated Task and Motion Planning . Annual Review of Control, Robotics, and Autonomous Systems, 4: 0 265--293, 2021

  19. [19]

    Robotouille: An Asynchronous Planning Benchmark for LLM Agents

    Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. Robotouille: An Asynchronous Planning Benchmark for LLM Agents . In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  20. [20]

    Multi-skill Mobile Manipulation for Object Rearrangement

    Jiayuan Gu, Devendra Singh Chaplot, Hao Su, and Jitendra Malik. Multi-skill Mobile Manipulation for Object Rearrangement . In Proceedings of the 11th International Conference on Learning Representations (ICML), 2022

  21. [21]

    Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition

    Huy Ha, Pete Florence, and Shuran Song. Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition . In Proceedings of the 7th Conference on Robot Learning (CoRL), pp.\ 3766--3777, 2023

  22. [22]

    InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning

    Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024

  23. [23]

    3D-LLM: Injecting the 3D World into Large Language Models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D World into Large Language Models . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 20482--20494, 2023

  24. [24]

    Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,

    Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning . arXiv preprint arXiv:2311.17842, 2023

  25. [25]

    Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents . In Proceedings of the 39th International Conference on Machine Learning (ICML), pp.\ 9118--9147, 2022

  26. [26]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models . In Proceedings of the 6th Conference on Ro...

  27. [27]

    RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation

    Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation . In Proceedings of the 8th Conference on Robot Learning, pp.\ 3027--3052, 2025

  28. [28]

    Prioritized Level Replay

    Minqi Jiang, Edward Grefenstette, and Tim Rockt \"a schel. Prioritized Level Replay . In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 4940--4950. PMLR, 2021

  29. [29]

    CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 2901--2910, 2017

  30. [30]

    Le, and Roni Stern

    Brendan Juba, Hai S. Le, and Roni Stern. Safe Learning of Lifted Action Models . In Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning (KR) , pp.\ 379--389, 11 2021

  31. [31]

    Position: LLM s Can t Plan, But Can Help Planning in LLM -Modulo Frameworks

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLM s Can t Plan, But Can Help Planning in LLM -Modulo Frameworks . In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  32. [32]

    K* and partial order reduction for top-quality planning

    Michael Katz and Junkyu Lee. K* and partial order reduction for top-quality planning. In Proceedings of the 16th Annual Symposium on Combinatorial Search (SoCS 2023). AAAI Press, 2023

  33. [33]

    On the Necessity of Abstraction

    George Konidaris. On the Necessity of Abstraction . Current Opinion in Behavioral Sciences, 29: 0 1--7, 2019. ISSN 2352-1546

  34. [34]

    Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

    George Konidaris and Andrew Barto. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining . In Advances in Neural Information Processing Systems (NIPS), volume 22, 2009

  35. [35]

    From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning

    George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Pérez. From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning . Journal of Artificial Intelligence Research, 61: 0 215--289, 2018

  36. [36]

    Planning for Learning Object Properties

    Leonardo Lamanna, Luciano Serafini, Mohamadreza Faridghasemnia, Alessandro Saffiotti, Alessandro Saetti, Alfonso Gerevini, and Paolo Traverso. Planning for Learning Object Properties . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (10): 0 12005--12013, Jun. 2023

  37. [37]

    Embodied Active Learning of Relational State Abstractions for Bilevel Planning

    Amber Li and Tom Silver. Embodied Active Learning of Relational State Abstractions for Bilevel Planning . In Proceedings of The 2nd Conference on Lifelong Learning Agents (CoLLAs), pp.\ 358--375, 2023

  38. [38]

    LEAGUE++: Empowering Continual Robot Learning via Guided Skill Acquisition with Large Language Models

    Zhaoyi Li, Kelin Yu, Shuo Cheng, and Danfei Xu. LEAGUE++: Empowering Continual Robot Learning via Guided Skill Acquisition with Large Language Models . In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  39. [39]

    Tenenbaum, Tom Silver, Joao F

    Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B. Tenenbaum, Tom Silver, Joao F. Henriques, and Kevin Ellis. VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning . In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  40. [40]

    OpenEQA: Embodied Question Answering in the Era of Foundation Models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravind ...

  41. [41]

    The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

    Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision . In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019

  42. [42]

    McDermott, M

    D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins. PDDL -- The Planning Domain Definition Language . Technical report, CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control, 1998

  43. [43]

    Grounding Predicates through Actions

    Toki Migimatsu and Jeannette Bohg. Grounding Predicates through Actions . In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), pp.\ 3498--3504, 2022

  44. [44]

    EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

    Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024

  45. [45]

    PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

    Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, and Brian Ichter. PIVOT: Iterative Visual Prompting Elicits Ac...

  46. [46]

    Introducing GPT-5 , 2025

    OpenAI. Introducing GPT-5 , 2025. URL https://openai.com/index/introducing-gpt-5/. Accessed:

  47. [47]

    CAPE: Corrective Actions from Precondition Errors using Large Language Models

    Shreyas Sundara Raman, Vanya Cohen, Ifrah Idrees, Eric Rosen, Ray Mooney, Stefanie Tellex, and David Paulius. CAPE: Corrective Actions from Precondition Errors using Large Language Models . In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 14070--14077, 2024

  48. [48]

    SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

    Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning . In Proceedings of the 7th Conference on Robot Learning (CoRL), volume 229, pp.\ 23--72, 06--09 Nov 2023

  49. [49]

    Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh

    Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until Confident: Efficient Exploration for Embodied Question Answering . In Proceedings of Robotics: Science and Systems (RSS) XX, 2024

  50. [50]

    RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics . In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 645--652. IEEE, 2024

  51. [51]

    Anytime Integrated Task and Motion Policies for Stochastic Environments

    Naman Shah, Deepak Kala Vasudevan, Kislay Kumar, Pranav Kamojjhala, and Siddharth Srivastava. Anytime Integrated Task and Motion Policies for Stochastic Environments . In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9285--9291. IEEE, 2020

  52. [52]

    From Real World to Logic and Back: Learning Generalizable Relational Concepts For Long Horizon Robot Planning,

    Naman Shah, Jayesh Nagpal, Pulkit Verma, and Siddharth Srivastava. From Reals to Logic and Back: Inventing Symbolic Vocabularies, Actions and Models for Planning from Raw Data . arXiv preprint arXiv:2402.11871, 2024

  53. [53]

    C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27 0 (3): 0 379--423, 1948

  54. [54]

    Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation . In Proceedings of the 6th Conference on Robot Learning, volume 205, pp.\ 785--799, 14--18 Dec 2023

  55. [55]

    Tenenbaum

    Tom Silver, Rohan Chitnis, Nishanth Kumar, Willie McClinton, Tomás Lozano-Pérez, Leslie Kaelbling, and Joshua B. Tenenbaum. Predicate Invention for Bilevel Planning . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (10): 0 12120--12129, Jun. 2023

  56. [56]

    Distilling Internet-Scale Vision-Language Models into Embodied Agents

    Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, and Ishita Dasgupta. Distilling Internet-Scale Vision-Language Models into Embodied Agents . In Proceedings of the Fortieth International Conference on Machine Learning (ICML), pp.\ 32797--32818, 2023

  57. [57]

    ViperGPT: Visual Inference via Python Execution for Reasoning

    D \' dac Sur \' s, Sachit Menon, and Carl Vondrick. ViperGPT: Visual Inference via Python Execution for Reasoning . In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 11888--11898, October 2023

  58. [58]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning . Artificial Intelligence, 112 0 (1): 0 181--211, 1999

  59. [59]

    Habitat 2.0: Training Home Assistants to Rearrange their Habitat

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladim\' r Vondru s , Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Traini...

  60. [60]

    On the Planning Abilities of Large Language Models - A Critical Investigation

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the Planning Abilities of Large Language Models - A Critical Investigation . In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp.\ 75993--76005, 2023

  61. [61]

    Discovering User-Interpretable Capabilities of Black-Box Planning Agents

    Pulkit Verma, Shashank Rao Marpally, and Siddharth Srivastava. Discovering User-Interpretable Capabilities of Black-Box Planning Agents . In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning (KR), volume 19, pp.\ 362--372, 2022

  62. [62]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models . Transactions on Machine Learning Research (TMLR), 2024 a . ISSN 2835-8856

  63. [63]

    Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

    Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions . arXiv preprint arXiv:1901.01753, 2019

  64. [64]

    RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

    Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback . In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 51484--51501, 21--27 Jul 2024 b

  65. [65]

    RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

    Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation . In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024 c

  66. [66]

    FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects . In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 17868--17879, 2024

  67. [67]

    Neuro-Symbolic Learning of Lifted Action Models from Visual Traces

    Kai Xi, Stephen Gould, and Sylvie Thiébaux. Neuro-Symbolic Learning of Lifted Action Models from Visual Traces . Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), 34 0 (1): 0 653--662, May 2024

  68. [68]

    Octopus: Embodied Vision-Language Programmer from Environmental Feedback

    Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied Vision-Language Programmer from Environmental Feedback . In Proceedings of the 2024 European Conference on Computer Vision (ECCV), pp.\ 20--38, 2024

  69. [69]

    ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation

    Naoki Yokoyama, Alex Clegg, Joanne Truong, Eric Undersander, Tsung-Yen Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 9 0 (1): 0 779--786, 2024

  70. [70]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  71. [71]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  72. [72]

    0362 #1 ^H 2

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...