pith. sign in

arxiv: 2605.30326 · v1 · pith:6IDMUJBGnew · submitted 2026-05-28 · 💻 cs.RO · cs.AI

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

Pith reviewed 2026-06-29 07:04 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robotic benchmarkcreative problem solvingvision-language-action modelstask mutationunexpected challengescognitive reasoningbi-manual manipulationrobustness evaluation
0
0 comments X

The pith

Pre-trained VLAs succeed on original robotic tasks but fail on mutated versions that require creative reasoning and adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboWits, a bi-manual robotic benchmark to evaluate cognitive reasoning, creative tool use, and robustness under unexpected conditions. It proposes an automated multi-agent pipeline that generates seed tasks and then mutates them across geometry, material, and assembly dimensions to create 208 varied scenarios from 30 seeds. Benchmark results show that pre-trained vision-language-action models achieve preliminary success on seed tasks after single-task fine-tuning yet perform poorly on the mutated tasks, exposing brittleness in strategy adaptation and handling of deceptive or constrained environments.

Core claim

The central claim is that while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments.

What carries the argument

The multi-agent cooperative framework for automated task generation, with agents handling seed task generation and verification, metric generation, scene generation, and task mutation to scale the creation of reasoning-centric unexpected scenarios.

If this is right

  • Current robotic benchmarks that focus on skill-level execution miss cognitive reasoning capabilities needed for real-world operation.
  • Pre-trained VLAs require additional mechanisms beyond single-task fine-tuning to handle variations in geometry, materials, and assembly constraints.
  • Oracle-state planners provide performance upper bounds but still reveal gaps that learned policies must close for robustness.
  • Bi-manual manipulation policies must incorporate strategy adaptation to succeed in deceptive or constrained environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models might close the gap by training explicitly on mutated or procedurally varied task distributions rather than isolated seeds.
  • The benchmark could serve as a diagnostic tool to measure progress toward open-ended robotic problem solving in unpredictable settings.
  • Real-world deployment of these systems may encounter similar failure modes unless training explicitly addresses unexpected changes in object properties or constraints.

Load-bearing premise

The mutations applied to the seed tasks create scenarios that require genuine creative reasoning and adaptation not captured by performance on the original tasks.

What would settle it

Demonstrating that a single fine-tuned VLA achieves comparable success rates on both the original seed tasks and the full set of 208 mutated tasks would falsify the brittleness claim.

Figures

Figures reproduced from arXiv: 2605.30326 by Chuang Gan, Chunru Lin, David Held, Fenghao Yu, Hongxin Zhang, Thomas L. Griffiths, Yejin Choi, Zhehuan Chen.

Figure 1
Figure 1. Figure 1: Creative problem solving under unexpected challenges. The figure contrasts the reasoning and execution processes of a standard robot (left) and an ideal robot (right) performing tasks under the same instruction but escalating difficulty. While both succeed in an Easy setting (top-left), the standard robot gets stuck repeating direct but unfeasible actions when the cube is trapped deep inside a cup (Hard, b… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our automated task generation pipeline, where foundation model-powered [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example object in the task schema. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of mutation on the retrieve cube seed task. Starting from the initial seed task, the mutator generates diverse task variations via pivot, trap, and add mutations, resulting in a family of tasks with distinct challenges and diverse difficulty. we obtain a family of task instances with diverse challenges and graded difficulty. To systematically increase task diversity and difficulty, we design t… view at source ↗
Figure 5
Figure 5. Figure 5: Gallery of the 30 bi-manual creative problem-solving manipulation tasks in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Distribution of task types and the three task [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of VLM planner∗∗ SRs under closed-loop and open-loop setting. Vision-language planners show promise through closed-loop reasoning. Given oracle object states and action feedback, VLM-based planners achieve reasonable performance on seed tasks and demonstrate superior ro￾bustness to mutations. Our ablation study comparing open￾loop and closed-loop variants underscores the necessity of reactivity:… view at source ↗
Figure 8
Figure 8. Figure 8: Ego-centric view of the 30 seed tasks in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: depicts three examples of mutation trees [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces RoboWits, a bi-manual robotic benchmark for evaluating cognitive reasoning, creative tool use, and robustness under unexpected conditions. It proposes a multi-agent cooperative pipeline to generate 30 seed tasks and 208 mutated tasks across geometry, material, and assembly axes, then benchmarks pre-trained VLAs (showing preliminary success on seed tasks after single-task fine-tuning) alongside other policies and planners, concluding that the performance drop on mutated tasks reveals brittleness in reasoning, strategy adaptation, and robustness to deceptive or constrained environments.

Significance. If the mutations are shown to require qualitatively new reasoning rather than parametric generalization, the benchmark and generation pipeline would address a genuine gap in existing robotic evaluation suites that focus primarily on skill execution. The automated multi-agent construction method is a concrete strength for scalability and reproducibility.

major comments (1)
  1. [Abstract] Abstract: The central interpretive claim—that the observed performance gap on mutated tasks demonstrates brittleness 'in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments'—rests on the unverified assumption that the 208 mutations cannot be solved by modest extensions of seed-task policies or non-reasoning mechanisms. No human baselines, oracle success rates, or analysis distinguishing qualitative new subgoals from distribution shift are referenced, which is load-bearing for the headline conclusion.
minor comments (1)
  1. [Abstract] The abstract states that tasks are 'graded difficulty' but provides no detail on the grading procedure, validation, or inter-rater reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's central claim. We address the concern regarding verification of the mutations' requirements point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central interpretive claim—that the observed performance gap on mutated tasks demonstrates brittleness 'in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments'—rests on the unverified assumption that the 208 mutations cannot be solved by modest extensions of seed-task policies or non-reasoning mechanisms. No human baselines, oracle success rates, or analysis distinguishing qualitative new subgoals from distribution shift are referenced, which is load-bearing for the headline conclusion.

    Authors: The multi-agent task generation pipeline was explicitly designed to produce mutations along geometry, material, and assembly axes that introduce qualitatively new constraints and subgoals (e.g., deceptive object placements or novel tool combinations) not addressable by direct policy extension from seeds; the 30 seed tasks were verified as solvable by the same policies before mutation. The oracle-state planners provide success rates showing that mutated tasks remain solvable with perfect information, establishing that the gap is not due to inherent task infeasibility. We agree that human baselines and explicit subgoal-shift analysis would strengthen the interpretation and will add both a discussion of expected human performance (based on task design) and a breakdown of mutation types versus distribution shift in the revised manuscript. The abstract will be updated to qualify the conclusion as suggestive of brittleness rather than definitive proof. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential reductions

full rationale

The manuscript introduces RoboWits as an empirical benchmark constructed via a described multi-agent pipeline, with results consisting of performance measurements on seed and mutated tasks. No equations, fitted parameters presented as predictions, uniqueness theorems, or load-bearing self-citations appear in the provided text. The central claim is an observed performance gap on mutated tasks, which is a direct empirical observation rather than a reduction to prior inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation of brittleness relies on the assumption that the generated mutated tasks validly probe reasoning capabilities beyond the seed tasks.

axioms (1)
  • domain assumption Automated multi-agent systems can generate valid and diverse robotic tasks that test creative problem solving when mutated
    Invoked in the description of the pipeline for curating the 30 seed tasks and 208 mutated tasks.

pith-pipeline@v0.9.1-grok · 5764 in / 1138 out tokens · 42024 ms · 2026-06-29T07:04:15.583086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 29 canonical work pages · 16 internal anchors

  1. [1]

    A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dha- balia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levine, A. Li-Bell, ...

  2. [2]

    G. Authors. Genesis: A universal and generative physics engine for robotics and beyond.URL https://github. com/Genesis-Embodied-AI/Genesis, 2024

  3. [3]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fu- sai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  4. [4]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control, 2026

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 9

  6. [6]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  7. [7]

    T. Chen, K. Wang, Z. Yang, Y . Zhang, Z. Chen, B. Chen, W. Dong, Z. Liu, D. Chen, T. Yang, et al. Benchmarking generalizable bimanual manipulation: Robotwin dual-arm collaboration challenge at cvpr 2025 meis workshop.arXiv preprint arXiv:2506.23351, 2025

  8. [8]

    Chernyadev, N

    N. Chernyadev, N. Backshall, X. Ma, Y . Lu, Y . Seo, and S. James. Bigym: A demo-driven mobile bi-manual manipulation benchmark.arXiv preprint arXiv:2407.07788, 2024

  9. [9]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  10. [10]

    K. M. Collins, C. Wong, J. Feng, M. Wei, and J. B. Tenenbaum. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks.arXiv preprint arXiv:2205.05718, 2022

  11. [11]

    Deitke, E

    M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, K. Ehsani, J. Salvador, W. Han, E. Kolve, A. Kembhavi, and R. Mottaghi. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022

  12. [12]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

  13. [13]

    H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

  14. [14]

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

  15. [15]

    J. P. Guilford. Creativity: Yesterday, today and tomorrow.The Journal of Creative Behavior, 1(1):3–14, 1967

  16. [16]

    J. Hao, N. Liang, Z. Luo, X. Xu, W. Zhong, R. Yi, Y . Jin, Z. Lyu, F. Zheng, L. Ma, et al. Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning.arXiv preprint arXiv:2509.22281, 2025

  17. [17]

    Z. Hu, A. Iscen, A. Jain, T. Kipf, Y . Yue, D. A. Ross, C. Schmid, and A. Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    Huang, F

    Z. Huang, F. Chen, Y . Pu, C. Lin, H. Su, and C. Gan. Diffvl: Scaling up soft body manipulation using vision-language driven differentiable physics.Advances in Neural Information Processing Systems, 36:29875–29900, 2023

  19. [19]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  20. [20]

    J. C. Kaufman and R. A. Beghetto. Beyond big and little: The four c model of creativity.Review of general psychology, 13(1):1–12, 2009

  21. [21]

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

  22. [22]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  23. [23]

    J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 10

  24. [24]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  25. [25]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  26. [26]

    C. Lin, J. Fan, Y . Wang, Z. Yang, Z. Chen, L. Fang, T.-H. Wang, Z. Xian, and C. Gan. Ubsoft: A simulation platform for robotic skill learning in unbounded soft environments.arXiv preprint arXiv:2411.12711, 2024

  27. [27]

    C. Lin, H. Yuan, Y . Wang, X. Qiu, T.-H. Wang, M. Guo, B. Wang, Y . Narang, D. Fox, and C. Gan. Robotsmith: Generative robotic tool design for acquisition of complex manipulation skills.arXiv preprint arXiv:2506.14763, 2025

  28. [28]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  29. [29]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  30. [30]

    J. Ma, W. Liang, H.-J. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. Dreureka: Language model guided sim-to-real transfer. RSS, 2024

  31. [31]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  32. [32]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  33. [33]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  34. [34]

    Pumacay, I

    W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A bench- mark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191, 2024

  35. [35]

    J. Song, H. Ma, O. Bagoren, A. Sethuraman, Y . Zhang, and K. A. Skinner. Oceansim: A gpu- accelerated underwater robot perception simulation framework. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1526–1533. IEEE, 2025

  36. [36]

    G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction.General- ist AI Blog, 2025. https://generalistai.com/blog/preview-uqlxvb-bb.html

  37. [37]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  38. [38]

    Y . Tian, A. Ravichander, L. Qin, R. Le Bras, R. Marjieh, N. Peng, Y . Choi, T. L. Griffiths, and F. Brahman. Macgyver: Are large language models creative problem solvers? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5303–5324, 2024

  39. [39]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  40. [40]

    C. Wang, H. Fang, H.-S. Fang, and C. Lu. Rise: 3d perception makes real-world robot imitation simple and effective. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2870–2877. IEEE, 2024. 11

  41. [41]

    T.-H. Wang, P. Ma, A. E. Spielberg, Z. Xian, H. Zhang, J. B. Tenenbaum, D. Rus, and C. Gan. Softzoo: A soft robot co-design benchmark for locomotion in diverse environments.arXiv preprint arXiv:2303.09555, 2023

  42. [42]

    Y . Wang, J. Duan, D. Fox, and S. Srinivasa. Newton: Are large language models capable of physical reasoning? InFindings of the association for computational linguistics: EMNLP 2023, pages 9743–9758, 2023

  43. [43]

    Y . Wang, X. Qiu, J. Liu, Z. Chen, J. Cai, Y . Wang, T.-H. Wang, Z. Xian, and C. Gan. Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting.Advances in Neural Information Processing Systems, 37:67575–67603, 2024

  44. [44]

    Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

  45. [45]

    Y . R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y . Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation. arXiv preprint arXiv:2507.00435, 2025

  46. [46]

    J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

  47. [47]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

  48. [48]

    M. Xu, P. Huang, W. Yu, S. Liu, X. Zhang, Y . Niu, T. Zhang, F. Xia, J. Tan, and D. Zhao. Creative robot tool use with large language models.arXiv preprint arXiv:2310.13065, 2023

  49. [49]

    Yang, F.-Y

    Y . Yang, F.-Y . Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

  50. [50]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  51. [51]

    Yenamandra, A

    S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, A. W. Clegg, J. Turner, et al. Homerobot: Open-vocabulary mobile manipulation.arXiv preprint arXiv:2306.11565, 2023

  52. [52]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

  53. [53]

    Zhang, Z

    S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

  54. [54]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  55. [55]

    Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y . Zhu. robo- suite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020

  56. [56]

    x _min": 0,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 Figure 8: Ego-centric view of the 30 seed tasks inRoboWits. A Task Details The details of 30 seed tasks are as fol...

  57. [57]

    39- No (Not Complete): There are missing obejcts mentioned but not in the ’ object_list’

    Completeness: 38- Yes (Complete): All the objects mentioned in the ’initial _scene_setup’, ’task_instruction’, ’task _success_criteria’ and ’potential _solution’ are in the ’object _list’, except the table and the robot. 39- No (Not Complete): There are missing obejcts mentioned but not in the ’ object_list’

  58. [58]

    a narrow gap), or geometry with unintended flaws (e.g., unintended holes, scratches, wear, or micro-grooves)

    Simulatability: 41- Impossible: Tasks involving aerodynamics, magnetism, thermodynamics, sticky materials, abstract objects (e.g. a narrow gap), or geometry with unintended flaws (e.g., unintended holes, scratches, wear, or micro-grooves). 42- Hard: Tasks involving contact-rich rigid-body interactions, soft bodies , liquids (including buoyancy), or compon...

  59. [59]

    High- complexity multi-object interactions that exceed dual-arm coordination limits are also considered not feasible

    Solution Feasibility: 45- Not Feasible: Solutions requiring breaking, tearing, or creating permanent defects (e.g., drilling holes) are not feasible. High- complexity multi-object interactions that exceed dual-arm coordination limits are also considered not feasible. 46- Kind of Feasible: Solutions requiring highly precise dynamic control, such as throwin...

  60. [60]

    52- Yes (Efficiency): If there is no other way except ’potential _solution’ to solve the task, it’s efficient

    Solution Efficiency (Bypass Check): 51- No (Not Efficient): The task is not efficient only if there is a significantly easier or more efficient solution. 52- Yes (Efficiency): If there is no other way except ’potential _solution’ to solve the task, it’s efficient. If there are other solutions, but ’potential_solution’ is easier for operation (e.g. requiri...

  61. [61]

    Each step involves a single robot arm to operate a single object, and the two robot arms 22 don’t need to operate together

    Difficulty (a score between 1 and 5): 56- Score 1: Tasks can be solved within two steps. Each step involves a single robot arm to operate a single object, and the two robot arms 22 don’t need to operate together. The objects used in the ’ potential_solution’ are common in daily life and easy to operate by a robot arm with a parallel gripper. 57- Score 2: ...

  62. [62]

    8- Modify or remove the MINIMAL number of existing objects (usually the primary tool) to block the current solution

    Pivot: Block the current ’potential _solution’ by modifying or removing objects in ’object _list’, then add new objects to support a new solution. 8- Modify or remove the MINIMAL number of existing objects (usually the primary tool) to block the current solution. 9- For any object mentioned in ’task _success_criteria’, it is not removable, and its ’object...

  63. [63]

    14- You must NOT modify or remove any original objects in the ’object _list ’

    Trap (ADDITIVE ONLY): 13- Add exactly one object that looks like a potential solution but fails due to wrong attributes (e.g., a ‘‘soft’’ bridge that collapses, a ‘‘ light’’ hammer, a screwdriver ‘‘fixed on the table’’). 14- You must NOT modify or remove any original objects in the ’object _list ’

  64. [64]

    17- You must NOT modify or remove any original objects in the ’object _list ’

    Related (ADDITIVE ONLY): 16- Add N objects that belong in the environment (e.g., a fork in a kitchen ) but are unnecessary for the solution. 17- You must NOT modify or remove any original objects in the ’object _list ’

  65. [65]

    x _min": 0.21,

    Unrelated (ADDITIVE ONLY): 19- Add N objects as visual/spatial noise (clutter). 20- You must NOT modify or remove any original objects in the ’object _list ’. 21 22IMPORTANT: 23- Always return the full, valid JSON object. 24- Ensure ’object _list’ and ’initial _scene_setup’ are perfectly synchronized. 23 25- For ADDITIVE types, the ’object _list’ must con...

  66. [66]

    39- False: the task is incomplete and fails

    Output Convention: 38- True: the task is completed and succeeds. 39- False: the task is incomplete and fails

  67. [67]

    Use scene.entities to access these objects within the code block

    State Extraction: Use the provided code to identify entity names and their starting positions. Use scene.entities to access these objects within the code block

  68. [68]

    Stick to ‘task _success_criteria‘: Implement the natural language success criteria in Python code

  69. [69]

    Spatial precision: When performing spatial checks (e.g., checking if objects are on the table, checking positions, overlaps), prefer using ‘convex_hull_2d‘ over ‘bounds‘ (AABB) for rigid objects, as ‘ convex_hull_2d‘ provides more precise geometry representation

  70. [70]

    _is_completed

    Default criteria: 44- Only check objects that are involved in the task _success_criteria. For those objects, ensure they do not fall off from the table (excluding ’table’ and ’robot’ from checks). 45 46OUTPUT FORMAT: 47‘‘‘json 48{ 49" _is_completed": "def _is_completed(objs_info) -> bool:\n \"\"\" Checks if the task is completed.\"\"\"\n ..." 50} 51‘‘‘ E....

  71. [71]

    29- Otherwise, always seek for suitable assets in the model library first

    Create Entities 28- If object in the object list has assigned ‘use _primitive‘ or ‘asset _id‘, respect it, use ‘retrieve _asset_info_by_id‘ tool to get the information of the asset. 29- Otherwise, always seek for suitable assets in the model library first. Only use primitives when there’s no good choice. 30- Scale the asset properly based on its bounds an...

  72. [72]

    Place Entities 34- With a retrieved asset, the appropriate ‘position‘, ‘euler‘ should be computed based on the asset’s bbox after scaling and the desired world -space alignment. 35- For liquids/granular objects (water, sands) that should be contained in a container object, carefully interpret the volume they should occupy (which may be smaller than the co...

  73. [73]

    recon" vis _mode for particle materials MPM and SPH. Default vis_mode of

    Verify Entities: 38- Use ‘render _current_frame‘ tool to get visual feedback of the current tabletop scene from four different views, and exmaine it carefully. 39- Use the ‘get _entity_info‘ tool to retrieve the current details of the entity. 40- Make sure no entity is in collision with other entities. Otherwise, adjust the ‘euler‘, ‘scale‘, and ‘position...

  74. [74]

    left" or

    MOVETO _GRASP(arm, object _name, grasping _direction, offs, dlt _euler, gripper) - Move end-effector to object for grasping 10- arm: "left" or "right" 11- object _name: name of object to grasp 12- grasping _direction: gripper orientation string in format "{{ gripper_faces}}_{{fingers_align}}_{{fingers_point}}" 13Common options: 14- "down _left-right_front...

  75. [75]

    left" or

    MOVETO _OBJECT(arm, object _name, dlt _pos, dlt _euler, steps, gripper) - Move arm to object position with offset 31- arm: "left" or "right" 32- object _name: name of object to move to 33- dlt _pos: [dx, dy, dz] position offset from object center (default [0, 0, 0]) 34- dlt _euler: [rx, ry, rz] delta euler angles in degrees (default [0, 0, 0]) 35- steps: ...

  76. [76]

    left" or

    OPEN _GRIPPER(arm) - Open the gripper 39- arm: "left" or "right" 40

  77. [77]

    left" or

    CLOSE _GRIPPER(arm) - Close the gripper 42- arm: "left" or "right" 43

  78. [78]

    " = both open,

    HOLD(steps, gripper) - Hold current position 45- steps: number of steps to hold (default 10) 46- gripper: which grippers to keep closed ("" = both open, "l" = left closed, "r" = right closed, "lr" = both closed) 47

  79. [79]

    left" or

    RETURN _HOME(arm) - Return arm to initial/home position 49- arm: "left" or "right" 50

  80. [80]

    x _min": 0.21,

    DONE() - Call this when the task appears to be completed 52- Use this when you believe the task goal has been achieved 53 54COORDINATE SYSTEM: 55- X: forward (positive) / backward (negative) 56- Y: left (positive) / right (negative) 57- Z: up (positive) / down (negative) 58- Robot workspace is roughly: {{"x _min": 0.21, "x _max": 1.00, "y _min": -0.69, "y...

Showing first 80 references.