pith. sign in

arxiv: 2507.21545 · v3 · submitted 2025-07-29 · 💻 cs.RO

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning

Pith reviewed 2026-05-19 02:55 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot task planningPDDL domain learningpretraining from demonstrationszero-shot generalizationsymbolic planningmanipulation videoscompositional generalization
0
0 comments X

The pith

A single PDDL domain pretrained from robot videos supports zero-shot planning on unseen manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that atomic planning domains can be extracted from a large collection of real robot videos and then fused on demand into meta-domains that let symbolic planners solve new task combinations without hand-crafted rules or task-specific retraining. A sympathetic reader would care because language-model planners often lose track of long-horizon constraints while purely symbolic planners require expert domain engineering that does not scale. If the method works, a robot could absorb broad action knowledge from demonstrations once and then compose reliable plans for novel goals that share substructures with the original data. The central object is the unified domain that stores operators, predicates, and causal edges so they can be retrieved and combined without introducing planning errors.

Core claim

Processing 12393 manipulation videos yields a unified PDDL domain containing 3137 operators, 2875 predicates, and 16481 causal edges. For any target task class the system retrieves the relevant atomic pieces and fuses them into a meta-domain that supplies the symbolic structure needed for compositional planning. This pipeline produces plans for complex unseen tasks in a zero-shot setting, with measured gains in both task success and plan optimality over direct LLM and LLM-PDDL baselines.

What carries the argument

The unified PDDL domain built by extracting atomic domains from videos and then retrieving and fusing them into task-specific meta-domains.

If this is right

  • Robots gain the ability to plan long sequences by composing pre-learned causal edges rather than rediscovering them for each new goal.
  • Plan optimality improves because the meta-domain inherits real-world ordering and precondition constraints captured from demonstrations.
  • Zero-shot generalization extends to any task class whose sub-actions appear in the video corpus, without requiring new domain authoring.
  • Symbolic planning becomes more tightly grounded in perception because predicates and operators are derived directly from observed manipulation sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the video corpus grows, the same extraction and fusion steps could cover an expanding range of manipulation skills with no additional expert input.
  • The retrieval-plus-fusion pattern might transfer to other formalisms such as temporal logic or STRIPS variants once similar atomic extraction is defined.
  • Running the fusion step incrementally as new videos arrive could keep the unified domain current without full re-pretraining.

Load-bearing premise

Atomic domains pulled from the videos accurately encode the implicit constraints and causal relations required for safe, error-free fusion into meta-domains that generalize to unseen tasks.

What would settle it

Deploy the fused meta-domains on a held-out set of tasks and check whether any generated plan produces physical violations or unsafe actions that contradict the constraints observed in the original video set.

Figures

Figures reproduced from arXiv: 2507.21545 by Cewu Lu, Haoming Ye, Panpan Cai, Yunxiao Xiao.

Figure 1
Figure 1. Figure 1: Visualization of our pre-trained unified domain, with 3,137 operator nodes (green) and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UniDomain. See detailed descriptions in Section 3. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison results of UniDomain and state-of-the-art methods on unseen evaluation tasks: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results for ablation studies on domain generation: (a) ablation on the atomic domain [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results for ablation study of the UniDo [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task used in the overview figure 2: "Move the corn from the pot into the orange bowl, wipe [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution sequence of the task shown in Figure 2, planned by UniDomain. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Arrange all blocks into two separate stacks on the table. The first stack should have blocks 1, 3, 5, and 7 in order from top to bottom. The second stack should have blocks 2, 4, 6, and 8 in order from top to bottom [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: There is a block in the green drawer. Please put it on the table, push it and put it in the yellow drawer [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Put the jujube in the green bowl. And put the white plate on the rack [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: There are a spoon, a tissue, an orange block in the green drawer. Stir the bowl and put the spoon in the cup, put the orange block into the yellow drawer, wipe the bowl and scrunch the tissue on the table. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
read the original abstract

Robotic task planning in real-world environments requires reasoning over implicit constraints from language and vision. While LLMs and VLMs offer strong priors, they struggle with long-horizon structure and symbolic grounding. Existing methods that combine LLMs with symbolic planning often rely on handcrafted or narrow domains, limiting generalization. We propose UniDomain, a framework that pre-trains a PDDL domain from robot manipulation demonstrations and applies it for online robotic task planning. It extracts atomic domains from 12,393 manipulation videos to form a unified domain with 3137 operators, 2875 predicates, and 16481 causal edges. Given a target class of tasks, it retrieves relevant atomics from the unified domain and systematically fuses them into high-quality meta-domains to support compositional generalization in planning. Experiments on diverse real-world tasks show that UniDomain solves complex, unseen tasks in a zero-shot manner, achieving up to 58% higher task success and 160% improvement in plan optimality over state-of-the-art LLM and LLM-PDDL baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents UniDomain, a framework that extracts atomic PDDL domains from 12,393 real-world robot manipulation videos to construct a unified domain containing 3137 operators, 2875 predicates, and 16481 causal edges. For a given target task class, relevant atomic domains are retrieved and fused into meta-domains that support compositional, zero-shot planning. Experiments on diverse real-world tasks report up to 58% higher task success and 160% better plan optimality than LLM and LLM-PDDL baselines.

Significance. If the extracted atomic domains faithfully encode implicit constraints and causal structure, the approach could meaningfully advance scalable, generalizable symbolic planning for robotics by moving beyond handcrafted or narrow domains. The scale of the demonstration corpus and the reported gains over strong baselines indicate practical potential for reducing reliance on manual domain engineering, provided the fusion step preserves soundness.

major comments (2)
  1. [Abstract / Domain Extraction] Abstract and domain-construction description: the central claim that the unified domain enables reliable zero-shot generalization rests on the fidelity of predicate/operator extraction and causal-edge inference from video demonstrations. No quantitative validation (e.g., precision/recall against ground-truth annotations or execution-trace consistency checks) is reported for the 2875 predicates or 16481 causal edges; systematic mis-extraction would directly undermine the 58 % success and 160 % optimality gains.
  2. [Meta-Domain Fusion] Fusion and retrieval step: the manuscript states that atomic domains are systematically fused into meta-domains, yet provides no analysis of how inconsistencies (e.g., conflicting preconditions or effect ordering) are detected or resolved. Without such checks, it is unclear whether the reported performance improvements can be attributed to the learned symbolic structure rather than incidental factors in the planning pipeline.
minor comments (2)
  1. [Retrieval Procedure] Clarify the exact criteria used to decide which atomic domains are retrieved for a given task class; the current description leaves the retrieval mechanism somewhat underspecified.
  2. [Unified Domain Statistics] Add a table or figure summarizing the distribution of operator arity, predicate types, and causal-edge density across the unified domain to help readers assess its complexity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our methods.

read point-by-point responses
  1. Referee: [Abstract / Domain Extraction] Abstract and domain-construction description: the central claim that the unified domain enables reliable zero-shot generalization rests on the fidelity of predicate/operator extraction and causal-edge inference from video demonstrations. No quantitative validation (e.g., precision/recall against ground-truth annotations or execution-trace consistency checks) is reported for the 2875 predicates or 16481 causal edges; systematic mis-extraction would directly undermine the 58 % success and 160 % optimality gains.

    Authors: We agree that explicit quantitative validation of the predicate and operator extraction, as well as causal-edge inference, would provide stronger support for the claims. The manuscript currently demonstrates the quality of the extracted domain indirectly through large-scale real-world experiments that yield substantial gains over baselines. We acknowledge that direct precision/recall metrics or systematic execution-trace checks against ground-truth annotations are not reported. In the revised version, we will expand the domain extraction section to include a human evaluation on a representative sample of the data, reporting precision and recall figures, along with additional consistency checks where execution traces can be obtained. revision: yes

  2. Referee: [Meta-Domain Fusion] Fusion and retrieval step: the manuscript states that atomic domains are systematically fused into meta-domains, yet provides no analysis of how inconsistencies (e.g., conflicting preconditions or effect ordering) are detected or resolved. Without such checks, it is unclear whether the reported performance improvements can be attributed to the learned symbolic structure rather than incidental factors in the planning pipeline.

    Authors: The referee is correct that the current description of the fusion step is high-level and does not detail inconsistency detection or resolution. The manuscript emphasizes systematic retrieval and fusion to produce meta-domains that support compositional zero-shot planning, but does not analyze conflict handling. We will revise the methods section to provide a precise description of the fusion algorithm, including how conflicting preconditions or effect orderings are identified and resolved (e.g., via constraint prioritization and consistency verification), and report empirical observations on conflict frequency and resolution outcomes from our experiments. revision: yes

Circularity Check

0 steps flagged

Data-driven extraction and held-out evaluation keep circularity low

full rationale

The derivation proceeds by extracting atomic PDDL domains from 12,393 videos, unifying them into a single domain, retrieving relevant atomics for a target task class, and fusing them into meta-domains. Reported success and optimality metrics are measured on held-out real-world tasks rather than being direct algebraic consequences of any fitted parameters or self-referential definitions. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the described pipeline. The central generalization claim therefore rests on empirical performance outside the training corpus.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that video demonstrations contain sufficient information to extract sound PDDL operators and predicates and that the fusion procedure preserves correctness; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Video demonstrations of manipulation tasks contain the causal structure and constraints needed to derive valid PDDL operators and predicates.
    Invoked when the paper states that atomic domains are extracted from 12,393 videos to form the unified domain.
  • domain assumption Relevant atomic domains can be retrieved and fused without introducing planning inconsistencies or unsafe actions for unseen tasks.
    Invoked in the description of retrieving atomics and systematically fusing them into meta-domains for target task classes.

pith-pipeline@v0.9.0 · 5722 in / 1437 out tokens · 24008 ms · 2026-05-19T02:55:59.465576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    On the limit of language models as planning formalizers

    Cassie Huang and Li Zhang. On the limit of language models as planning formalizers. arXiv preprint arXiv:2412.09879, 2024

  4. [4]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023

  5. [5]

    Dynamic planning with an llm

    Gautier Dagan, Frank Keller, and Alex Lascarides Keller. Dynamic planning with an llm. In Proceedings of the Language Gamification Workshop 2024 at NeurIPS, pages 1–14. Neural Information Processing Systems Foundation (NeurIPS), October 2024. Language Gamification Workshop 2024 at NeurIPS ; Conference date: 14-12-2024 Through 14-12-2024

  6. [6]

    PDDLEGO: Iterative planning in textual environments

    Li Zhang, Peter Jansen, Tianyi Zhang, Peter Clark, Chris Callison-Burch, and Niket Tandon. PDDLEGO: Iterative planning in textual environments. In Danushka Bollegala and Vered Shwartz, editors, Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pages 212–221, Mexico City, Mexico, June 2024. Association for Computa...

  7. [7]

    Pddl| the planning domain definition language

    Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christianson, et al. Pddl| the planning domain definition language. Technical Report, Tech. Rep., 1998

  8. [8]

    The fast downward planning system

    Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26:191–246, 2006

  9. [9]

    Autogpt+ p: Affordance- based task planning with large language models

    Timo Birr, Christoph Pohl, Abdelrahman Younes, and Tamim Asfour. Autogpt+ p: Affordance- based task planning with large language models. arXiv preprint arXiv:2402.10778, 2024

  10. [10]

    Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, and Shinsuke Mori

    Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, and Shinsuke Mori. Vision- language interpreter for robot task planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2051–2058, 2024

  11. [11]

    Leveraging pre-trained large language models to construct and utilize world models for model-based task planning

    Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  12. [12]

    Interpret: Interactive predicate learning from language feedback for generalizable task planning

    Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. Interpret: Interactive predicate learning from language feedback for generalizable task planning. arXiv preprint arXiv:2405.19758, 2024

  13. [13]

    Towards robust LLM-driven planning from minimal text descriptions

    Elliot Gestrin, Marco Kuhlmann, and Jendrik Seipp. Towards robust LLM-driven planning from minimal text descriptions. In ICAPS 2024 Workshop on Human-Aware Explainable Planning, 2024

  14. [14]

    Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge lin, and Weiyang Liu

    Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge lin, and Weiyang Liu. Generating symbolic world models via test-time scaling of large language models. Transactions on Machine Learning Research, 2025

  15. [15]

    RDT-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations, 2025. 10

  16. [16]

    Open- VLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Lear...

  17. [17]

    π0: A vision-language-action flow model for general robot control, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  18. [18]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  19. [19]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

    Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, pages 1–65, 2024

  20. [20]

    Yu, and Jianfeng Gao

    Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, et al. A survey on post-training of large language models.arXiv preprint arXiv:2503.06072, 2025

  21. [21]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2023

  22. [22]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  23. [23]

    Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning

    Zhehua Zhou, Jiayang Song, Kunpeng Yao, Zhan Shu, and Lei Ma. Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2081–2088, 2024

  24. [24]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  25. [25]

    Progprompt: Generating situated robot task plans using large language models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023

  26. [26]

    Do as i can, not as i say: Grounding language in robotic affordances

    brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar ...

  27. [27]

    Saycanpay: Heuristic planning with large language models using learnable domain knowledge

    Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt. Saycanpay: Heuristic planning with large language models using learnable domain knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20123–20133, 2024

  28. [28]

    11 Innermonologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan James Richard Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Andrew Ichter. 11 Innermonologue: Embodied reasoning through planning with language models. 2022. CoRL 2022 (to appear)

  29. [29]

    Reflex- ion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflex- ion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023

  30. [30]

    Large language models as commonsense knowledge for large-scale task planning

    Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 31967–31987. Curran Associates, Inc., 2023

  31. [31]

    Chain- of-symbol prompting for spatial reasoning in large language models

    Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, and Yue Zhang. Chain- of-symbol prompting for spatial reasoning in large language models. In First Conference on Language Modeling, 2024

  32. [32]

    Look before you leap: Unveiling the power of GPT-4v in robotic vision-language planning

    Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of GPT-4v in robotic vision-language planning. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

  33. [33]

    Siegel, Jiahai Feng, Noa Korneev, Joshua B

    Lio Wong, Jiayuan Mao, Pratyusha Sharma, Zachary S. Siegel, Jiahai Feng, Noa Korneev, Joshua B. Tenenbaum, and Jacob Andreas. Learning adaptive planning representations with natural language guidance. In International Conference on Learning Representations (ICLR), 2024

  34. [34]

    Language-augmented symbolic planner for open-world task planning

    Guanqi Chen, Lei Yang, Ruixing Jia, Zhe Hu, Yizhou Chen, Wei Zhang, Wenping Wang, and Jia Pan. Language-augmented symbolic planner for open-world task planning. arXiv preprint arXiv:2407.09792, 2024

  35. [35]

    Learning compositional behaviors from demonstration and language

    Weiyu Liu, Neil Nie, Ruohan Zhang, Jiayuan Mao, and Jiajun Wu. Learning compositional behaviors from demonstration and language. In 8th Annual Conference on Robot Learning, 2024

  36. [36]

    Predicate invention from pixels via pretrained vision-language models

    Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Predicate invention from pixels via pretrained vision-language models. arXiv preprint arXiv:2501.00296, 2024

  37. [37]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In CoRL Workshop on Learning Robot Fine and Dexterous Manipulation: Perception and Control, 2024

  38. [38]

    You only teach once: Learn one- shot bimanual robotic manipulation from video demonstra- tions

    Huayi Zhou, Ruixiang Wang, Yunxin Tai, Yueci Deng, Guiliang Liu, and Kui Jia. You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations. arXiv preprint arXiv:2501.14208, 2025

  39. [39]

    When video coding meets multimodal large language models: A unified paradigm for video coding

    Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, and Shiqi Wang. When video coding meets multimodal large language models: A unified paradigm for video coding. arXiv preprint arXiv:2408.08093, 2024

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  41. [41]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  42. [42]

    Mpnet: Masked and permuted pre-training for language understanding

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857–16867, 2020. 12

  43. [43]

    Maria Fox and Derek Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains. Journal of artificial intelligence research, 20:61–124, 2003

  44. [44]

    Håkan LS Younes and Michael L Littman. Ppddl1. 0: An extension to pddl for expressing planning domains with probabilistic effects. Techn. Rep. CMU-CS-04-162, 2:99, 2004

  45. [45]

    Move the corn from the pot into the orange bowl, wipe the table with the towel in the drawer and put it back to the closed drawer

    Scott Sanner et al. Relational dynamic influence diagram language (rddl): Language description. Unpublished ms. Australian National University, 32:27, 2010. 13 A Technical Appendices and Supplementary Material A.1 The Task Included in Overview The task used in the overview is shown in 7. The language instruction is “Move the corn from the pot into the ora...

  46. [46]

    pick_from_rack (bowl)

  47. [47]

    place_on_table (bowl)

  48. [48]

    pick_from_pot (corn)

  49. [49]

    open_drawer (drawer_yellow)

  50. [50]

    pick_from_drawer (towel)

  51. [51]

    place_in_drawer (towel)

  52. [52]

    Move the corn from the pot into the orange bowl, wipe the table with the towel in the drawer and put it back to the closed drawer

    close_drawer (drawer_yellow) The execution results are shown in 7. Figure 6: Task used in the overview figure 2: "Move the corn from the pot into the orange bowl, wipe the table with the towel in the drawer and put it back to the closed drawer." 14 Figure 7: Execution sequence of the task shown in Figure 2, planned by UniDomain. A.2 Evaluation Tasks We pr...