Playful Agentic Robot Learning

Angjoo Kanazawa; Dantong Niu; David M. Chan; Haiwen Feng; Hanjun Yoo; Ion Stoica; Jiahui Lei; Jiaxin Ge; Junyi Zhang; Justin Yu

arxiv: 2606.19419 · v1 · pith:PTBKU5GWnew · submitted 2026-06-17 · 💻 cs.RO · cs.AI

Playful Agentic Robot Learning

Junyi Zhang , Jiaxin Ge , Hanjun Yoo , Letian Fu , Zihan Yang , Yaowei Liu , Raj Saravanan , Shaofeng Yin

show 12 more authors

Justin Yu Dantong Niu Zirui Wang Roei Herzig Ken Goldberg Yutong Bai David M. Chan Ion Stoica Angjoo Kanazawa Jiahui Lei Haiwen Feng Trevor Darrell

This is my paper

Pith reviewed 2026-06-26 20:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords playful learningcode-as-policyrobot skill acquisitionagentic systemsself-directed explorationskill libraryembodied agentsLIBERO benchmark

0 comments

The pith

Robot agents build reusable code skills through self-directed play before facing specific tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current Code-as-Policy robot systems acquire reusable skills only after explicit task instructions, limiting their adaptability. It proposes that an embodied agent can instead use a pre-task stage of playful exploration to propose its own learnable tasks, execute and verify code policies with feedback, diagnose failures, and distill successes into a persistent skill library. When downstream tasks arrive, the agent retrieves relevant skills from this library to improve performance. Experiments demonstrate concrete gains on held-out benchmarks and show that the same library can be inserted into other agents at inference time.

Core claim

Robotics Agent Teams perform self-directed play by proposing novel yet learnable exploratory tasks, planning and executing robot-code policies, verifying intermediate progress with step-level feedback, diagnosing failures, retrying, and distilling successful executions into a frozen code skill library that is later retrieved to solve new tasks.

What carries the argument

RATs (Robotics Agent Teams) that generate exploratory tasks during play, execute code policies, verify outcomes, and compile successes into a reusable code skill library for later retrieval.

Load-bearing premise

The agents can reliably propose novel yet learnable exploratory tasks, verify intermediate progress, diagnose failures, and distill successful executions into a persistent reusable code skill library without external supervision or post-hoc human curation.

What would settle it

An experiment in which skills retrieved from the play-built library produce no improvement or lower success rates on held-out downstream tasks compared to no-play or random-play baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19419 by Angjoo Kanazawa, Dantong Niu, David M. Chan, Haiwen Feng, Hanjun Yoo, Ion Stoica, Jiahui Lei, Jiaxin Ge, Junyi Zhang, Justin Yu, Ken Goldberg, Letian Fu, Raj Saravanan, Roei Herzig, Shaofeng Yin, Trevor Darrell, Yaowei Liu, Yutong Bai, Zihan Yang, Zirui Wang.

**Figure 1.** Figure 1: RATS enables Playful Agentic Robot Learning. Prior to receiving extrinsic reward signals, play-based Robotics Agent Teams autonomously propose intrinsic goals, practice them through Code-as-Policy execution, and distill successful behaviors into a reusable code skill library. At test time, the learned skills are retrieved and reused to solve future tasks. Abstract: Current agentic robot systems can write … view at source ↗

**Figure 2.** Figure 2: RATS at play. RATS proposes self-directed play tasks, solves them with a Code-asPolicy agent team, and uses verification and diagnosis feedback to retry failed attempts. Successful behaviors are distilled into a reusable skill library, which is later retrieved at test time for target tasks. the learnability sweet spot (r¯ ≈ 0.5), encouraging semi-familiar tasks over trivial (r¯ → 1) or impossible (r¯ → 0… view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons in simulation. Swap Cubes Close Drawer Open Drawer Place Cube in Bowl [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of sim-to-real transfer. Methods. We compare against VLA policies, OpenVLA [61], π0 [5], and π0.5 [6], and against CaP-Agent0 [7], the Code-as-Policy agent run with only the primitive library L0 (the “No Play” condition). For RATS, we play on LIBERO-PRO and MolmoSpaces environment for 50 iterations each with gemini-3.1pro-preview. See Appendix A for more details. Metric and Evaluation M… view at source ↗

**Figure 5.** Figure 5: Example trace of play-time task proposal. An example of the MolmoSpaces play-time task proposal process at iteration 15. A.1.1 Example Trace of Play-Time Task Proposal [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of play objectives. Counts are computed from the 49 saved proposal records available for the 50-iteration run. C Details about Play Process C.1 Play-Time Objectives and Knowledge Accumulation We further analyze the objectives proposed during play and the knowledge accumulated from solving them. Below we analyze a 50-iteration play run in MolmoSpaces. Distribution of proposed objectives [PITH… view at source ↗

**Figure 7.** Figure 7: Skill library and failure memory growth during MolmoSpaces play. Reports learned skills, verified/experimental/deprecated skill counts, raw failure episodes, and distilled lessons. Learning from unsuccessful play. Failed trajectories also contribute useful supervision. For example, failing to open a sidetable drawer at iteration 2 produces experimental helpers for axis-aligned pull-direction estimation an… view at source ↗

**Figure 8.** Figure 8: Proportion of learned skill calls during MolmoSpaces evaluation. Each cell reports the runtime invocation count and the share of learned-skill calls within that task type. Each column aggregates 100 evaluation trials. components are composed into a single policy step that localizes the cabinet handle, plans the grasp, grasps the handle, and pulls the cabinet open. These helpers were generated by Skill Prop… view at source ↗

**Figure 9.** Figure 9: Play-to-evaluation transfer lineage for a successful MolmoSpaces evaluation trial. Play-time tasks lead to learned skills stored in the frozen skill library and then to their compositional use in evaluation. CaP-Agent0 Direct Synthesis RATS with learned-skills Approach Handle interaction Final state Approach Handle interaction Final state × Failed # Simplified version of getting grasps obs = get_observatio… view at source ↗

**Figure 10.** Figure 10: Direct code synthesis versus synthesis with learned skills. The condensed sourcebacked excerpts explain why learned skills reduce brittle low-level reasoning. C.3 Qualitative Comparison with Direct Code Synthesis [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: More qualitative comparisons between direct code synthesis and RATS with learned skills. Across shoe picking, table closing, and pick-and-place tasks, the CAP-AGENT0 direct-synthesis policies fail while the RATS policies succeed. CAP-AGENT0 recomputes perception, geometry, pos/quat, and waypoint logic inline; RATS calls learned skills for verified localization, grasping, closing, transport, and release.… view at source ↗

**Figure 12.** Figure 12: LIBERO-to-RoboSuite skill transfer in two-arm lifting. For the same RoboSuite two_arm_lift evaluation seed, direct code synthesis fails while RATS succeeds by reusing skills selected from a LIBERO-derived library [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RATs show gains from letting robots run unsupervised play to build a reusable code skill library before tasks arrive, but the reliability of that play pipeline is the part that needs checking.

read the letter

The main point is that this paper adds a pre-task play stage where embodied coding agents propose their own exploratory tasks, plan and run code policies, verify progress step by step, diagnose failures with feedback, and distill wins into a frozen library. At test time the library gets retrieved to help on new tasks, and the skills can be dropped into other agents without retraining.

It reports clear numbers: 20.6 points better than CaP-Agent0 on LIBERO-PRO and 17 points on MolmoSpaces, plus 8.9 and 8.8 point lifts on RoboSuite and real-world transfer when the skills are just plugged in. The transfer-without-finetuning result is the most practical piece.

The soft spot is exactly the unsupervised play loop. The gains depend on the agents reliably proposing learnable tasks, verifying without external help, diagnosing failures densely, and producing skills that are actually reusable. If any of those steps needs hidden curation or oracle signals, the no-play and random-play baselines stop being fair comparisons. The abstract gives no error bars, controls, or exclusion rules, so the full methods have to show the pipeline runs end-to-end without intervention.

This is for people working on Code-as-Policy and embodied agents who want to explore continual skill acquisition. A reader already following that line of work will see the empirical angle quickly.

Send it to peer review. The claims are specific enough that referees can test whether the play phase actually delivers what it says.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Playful Agentic Robot Learning, in which Robotics Agent Teams (RATs) perform self-directed play to autonomously propose exploratory tasks, execute Code-as-Policy programs, verify progress, diagnose failures with dense feedback, and distill successes into a reusable frozen code skill library. These skills are then retrieved at test time to improve held-out downstream task performance, yielding 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively, plus 8.9 and 8.8 point improvements when plugged into other agents on RoboSuite and real-world transfer without finetuning the base model.

Significance. If the unsupervised play pipeline functions as described, the work offers a concrete path toward modular, reusable skill libraries in embodied agents that can be composed at inference time. The reported benchmark gains on named suites and the zero-finetune transfer results provide a clear, falsifiable basis for evaluating the approach. The absence of free parameters in the core claim and the emphasis on persistent code skills are strengths that distinguish it from purely task-driven baselines.

major comments (2)

[abstract and §3] Play phase description (abstract and §3): The central empirical gains rest on the claim that RATs autonomously propose novel yet learnable tasks, verify intermediate progress, diagnose failures, and distill skills without external supervision or post-hoc curation. No concrete mechanisms, success criteria, or failure modes for autonomous verification and diagnosis are provided; if these steps rely on implicit human filtering or oracle signals, the no-play and random-play baselines become incomparable and the 20.6 pp / 17.0 pp improvements do not follow from the stated method.
[results section and tables] Experimental controls (results section and tables reporting LIBERO-PRO / MolmoSpaces): The abstract and results report percentage-point gains but supply no information on number of runs, statistical significance tests, data exclusion rules, or error bars. Without these, it is impossible to assess whether the observed differences are robust or whether the play-learned library genuinely drives the improvement versus variance in the underlying Code-as-Policy agent.

minor comments (2)

[abstract] Abstract: the phrase 'without finetuning the underlying model' should be accompanied by a brief statement of what model is used and whether any prompt engineering or retrieval hyperparameters are tuned.
Notation: 'RATs' is introduced as an acronym but the expansion 'Robotics Agent Teams' appears only once; consistent use of the acronym after first definition would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen clarity and reporting where needed.

read point-by-point responses

Referee: [abstract and §3] Play phase description (abstract and §3): The central empirical gains rest on the claim that RATs autonomously propose novel yet learnable tasks, verify intermediate progress, diagnose failures, and distill skills without external supervision or post-hoc curation. No concrete mechanisms, success criteria, or failure modes for autonomous verification and diagnosis are provided; if these steps rely on implicit human filtering or oracle signals, the no-play and random-play baselines become incomparable and the 20.6 pp / 17.0 pp improvements do not follow from the stated method.

Authors: Section 3 specifies that verification and diagnosis occur via the agent's internal code execution traces and self-generated dense feedback prompts, with no external oracle or human curation applied to the play phase. The no-play and random-play baselines use identical execution and retrieval pipelines, isolating the effect of the distilled library. To address the request for greater concreteness, we will add explicit pseudocode for the verification loop, success criteria (e.g., task-completion predicates), and representative failure-mode traces in the revised §3. revision: yes
Referee: [results section and tables] Experimental controls (results section and tables reporting LIBERO-PRO / MolmoSpaces): The abstract and results report percentage-point gains but supply no information on number of runs, statistical significance tests, data exclusion rules, or error bars. Without these, it is impossible to assess whether the observed differences are robust or whether the play-learned library genuinely drives the improvement versus variance in the underlying Code-as-Policy agent.

Authors: The referee correctly notes the absence of run counts, error bars, and significance tests. In the revision we will report results over 5 independent seeds per condition, include standard-error bars on all tables, state the data-exclusion rule (failed environment resets only), and add paired t-tests between conditions to quantify robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains rest on experimental comparisons, not derivations

full rationale

The paper reports empirical performance improvements from play-learned skills on held-out downstream tasks (LIBERO-PRO, MolmoSpaces, RoboSuite, real-world) versus no-play and random-play baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims are direct experimental outcomes rather than any reduction of a result to its own inputs by construction. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level system name; the method appears to rely on standard assumptions of simulation environments and code execution feedback.

invented entities (1)

RATs (Robotics Agent Teams) no independent evidence
purpose: Embodied coding agents designed for play-time skill acquisition via self-directed exploration and code distillation
Introduced as the core new system in the abstract; no independent evidence provided outside the paper.

pith-pipeline@v0.9.1-grok · 5834 in / 1256 out tokens · 33888 ms · 2026-06-26T20:52:22.988305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

121 extracted references · 8 canonical work pages

[1]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[2]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. ProgPrompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523– 11530. IEEE, 2023

2023
[3]

Vemprala, R

S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. ChatGPT for robotics: Design principles and model abilities.IEEE Access, 12:55682–55696, 2024

2024
[4]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023
[5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, 9 K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilin- sky.π 0: A vision-language-action flow model for general robot control. InProcee...

2025
[6]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tan...

2025
[7]

M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. Fan. CaP-X: A framework for benchmarking and improving coding agents for robot manipulation.arXiv preprint arXiv:2603.22435, 2026

arXiv 2026
[8]

Piaget.The Origins of Intelligence in Children

J. Piaget.The Origins of Intelligence in Children. International Universities Press, New York, 1952

1952
[9]

A. Gopnik. Childhood as a solution to explore-exploit tensions.Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1803):20190502, 2020. doi:10.1098/rstb.2019. 0502

work page doi:10.1098/rstb.2019 2020
[10]

L. B. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1-2):13–29, 2005. doi:10.1162/1064546053278973

work page doi:10.1162/1064546053278973 2005
[11]

Schmidhuber

J. Schmidhuber. Curious model-building control systems. InProceedings of the International Joint Conference on Neural Networks (IJCNN), volume 2, pages 1458–1463, Singapore, 1991. IEEE

1991
[12]

Oudeyer, F

P.-Y . Oudeyer, F. Kaplan, and V . V . Hafner. Intrinsic motivation systems for autonomous mental development.IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007. doi:10.1109/TEVC.2006.890271

work page doi:10.1109/tevc.2006.890271 2007
[13]

Baranes and P.-Y

A. Baranes and P.-Y . Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots.Robotics and Autonomous Systems, 61(1):49–73, 2013

2013
[14]

Forestier and P.-Y

S. Forestier and P.-Y . Oudeyer. Modular active curiosity-driven discovery of tool use. In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3965– 3972, Daejeon, Korea, 2016. IEEE

2016
[15]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. InProceedings of the 34th International Conference on Machine Learn- ing (ICML), volume 70 ofProceedings of Machine Learning Research, pages 2778–2787, 2017

2017
[16]

Houthooft, X

R. Houthooft, X. Chen, Y . Duan, J. Schulman, F. De Turck, and P. Abbeel. VIME: Variational information maximizing exploration. InAdvances in Neural Information Processing Systems 29 (NeurIPS), 2016

2016
[17]

L. S. Vygotsky.Mind in Society: The Development of Higher Psychological Processes. Har- vard University Press, Cambridge, MA, 1978. Edited by Michael Cole, Vera John-Steiner, Sylvia Scribner, and Ellen Souberman

1978
[18]

A. D. Pellegrini.The Role of Play in Human Development. Oxford University Press, Oxford, UK, 2009. ISBN 9780195367324

2009
[19]

Schmidhuber

J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010. 10

1990
[20]

Kaplan and P.-Y

F. Kaplan and P.-Y . Oudeyer. In search of the neural circuits of intrinsic motivation.Frontiers in Neuroscience, 1(1):225–236, 2007. doi:10.3389/neuro.01.1.1.017.2007

work page doi:10.3389/neuro.01.1.1.017.2007 2007
[21]

C. Kidd, S. T. Piantadosi, and R. N. Aslin. The goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex.PLOS ONE, 7(5): e36399, 2012. doi:10.1371/journal.pone.0036399

work page doi:10.1371/journal.pone.0036399 2012
[22]

M. Rolf, J. J. Steil, and M. Gienger. Goal babbling permits direct learning of inverse kine- matics.IEEE Transactions on Autonomous Mental Development, 2(3):216–229, 2010. doi: 10.1109/TAMD.2010.2062511

work page doi:10.1109/tamd.2010.2062511 2010
[23]

Forestier, R

S. Forestier, R. Portelas, Y . Mollard, and P.-Y . Oudeyer. Intrinsically motivated goal explo- ration processes with automatic curriculum learning.Journal of Machine Learning Research, 23(152):1–41, 2022

2022
[24]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In L. P. Kaelbling, D. Kragic, and K. Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1113–1132. PMLR, 30 Oct–01 Nov 2020. URLhttps://proceedings.mlr.press/ v100/...

2020
[25]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimic- play: Long-horizon imitation learning by watching human play. In J. Tan, M. Toussaint, and K. Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pages 201–221. PMLR, 06–09 Nov 2023. URL https:/...

2023
[26]

Colas, T

C. Colas, T. Karch, N. Lair, J.-M. Dussoux, C. Moulin-Frier, P. F. Dominey, and P.-Y . Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020
[27]

Thrun and T

S. Thrun and T. M. Mitchell. Lifelong robot learning.Robotics and Autonomous Systems, 15 (1–2):25–46, 1995

1995
[28]

G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019. doi:10.1016/j.neunet.2019. 01.012

work page doi:10.1016/j.neunet.2019 2019
[29]

Lesort, V

T. Lesort, V . Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58:52–68, 2020

2020
[30]

R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for tem- poral abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211, 1999

1999
[31]

Konidaris, S

G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto. Robot learning from demonstration by constructing skill trees.The International Journal of Robotics Research, 31(3):360–375, 2012

2012
[32]

Kroemer, S

O. Kroemer, S. Niekum, and G. Konidaris. A review of robot learning for manipulation: Challenges, representations, and algorithms.Journal of Machine Learning Research, 22(30): 1–82, 2021

2021
[33]

Pertsch, Y

K. Pertsch, Y . Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. InProceedings of the 4th Conference on Robot Learning (CoRL), volume 155 ofProceedings of Machine Learning Research, pages 188–204, 2020

2020
[34]

Lynch and P

C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. InProceedings of Robotics: Science and Systems (RSS), 2021. doi:10.15607/RSS.2021.XVII. 047. 11

work page doi:10.15607/rss.2021.xvii 2021
[35]

W. Wan, Y . Zhu, R. Shah, and Y . Zhu. Lotus: Continual imitation learning for robot manip- ulation through unsupervised skill discovery, 2024. URLhttps://arxiv.org/abs/2311. 02058

2024
[36]

Y . J. Ma, W. Liang, H.-J. Wang, S. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. DrEureka: Language model guided sim-to-real transfer. InRobotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2406.01967

arXiv 2024
[37]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 41–48. ACM, 2009

2009
[38]

Narvekar, B

S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research, 21(181):1–50, 2020

2020
[39]

Florensa, D

C. Florensa, D. Held, X. Geng, and P. Abbeel. Automatic goal generation for reinforcement learning agents. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1515–1528, 2018

2018
[40]

A. Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. InAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

2018
[41]

V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-Fit: State-covering self- supervised reinforcement learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 7783–7792, 2020

2020
[42]

Seita, D

D. Seita, D. Chan, R. Rao, C. Tang, M. Zhao, and J. Canny. ZPD teaching strategies for deep reinforcement learning from demonstrations. InDeep Reinforcement Learning Workshop at NeurIPS, 2019

2019
[43]

Y . Mu, J. Chen, Q. Zhang, S. Chen, Q. Yu, C. Ge, R. Chen, Z. Liang, M. Hu, C. Tao, P. Sun, H. Yu, C. Yang, W. Shao, W. Wang, J. Dai, Y . Qiao, M. Ding, and P. Luo. Robocodex: Multimodal code generation for robotic behavior synthesis, 2024. URLhttps://arxiv. org/abs/2402.16117

arXiv 2024
[44]

G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Bal- akrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

Pith/arXiv arXiv 2025
[45]

J. Shi, R. Yang, K. Chao, B. S. Wan, Y . S. Shao, J. Lei, J. Qian, L. Le, P. Chaudhari, K. Dani- ilidis, et al. Maestro: Orchestrating robotics modules with vision-language models for zero- shot generalist robots. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embod- ied AI, 2025

2025
[46]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettingh...

2022
[47]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, 12 K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model. InProceedings of the 40th International Conf...

2023
[48]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. InProceedings of the 7th Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, 2023

2023
[49]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence. Socratic models: Composing zero-shot multimodal reasoning with language. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[50]

Y . Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y . Qiao, and P. Luo. EmbodiedGPT: Vision-language pre-training via embodied chain of thought. InAdvances in Neural Information Processing Systems 36 (NeurIPS), pages 25081–25094, 2023

2023
[51]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[52]

Shinn, F

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. InAdvances in Neural Information Process- ing Systems 36 (NeurIPS), 2023

2023
[53]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prab- humoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Informa- tion Processing Systems 36 (NeurIPS), 2023

2023
[54]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[55]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oy- ager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research (TMLR), 2024

2024
[56]

K. Lin, C. Snell, Y . Wang, C. Packer, S. Wooders, I. Stoica, and J. E. Gonzalez. Sleep-time compute: Beyond inference scaling at test-time.arXiv preprint arXiv:2504.13171, 2025

arXiv 2025
[57]

Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, K. Lin, S. Nasiriany, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

Pith/arXiv arXiv 2009
[58]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[59]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025
[60]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A large-scale open ecosystem for robot navigation and manipulation.arXiv preprint arXiv:2602.11337, 2026

arXiv 2026
[61]

I’m curious if I can lift the brown tissue box straight up into the air

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, 2024. 13 Appendix Table of Contents A Implementation Details about RATS15 A.1 Details of Play-Time Task Proposal . . . . . . . . . . . . . ....

2024
[62]

Review the task history — what worked, what failed, and WHY?
[63]

If recent tasks failed, consider whether to try a simpler variant or a 41completely different approach
[64]

If recent tasks succeeded, consider building on those skills with 43something slightly harder — but still single-step and child-feasible
[65]

45The target object (arg1 of an ‘On‘/‘In‘ goal) MUST come from the RELIABLE 46list in the Pick-Primitive Reliability block

Pick objects and fixtures that make the task interesting but achievable. 45The target object (arg1 of an ‘On‘/‘In‘ goal) MUST come from the RELIABLE 46list in the Pick-Primitive Reliability block
[66]

Ensure the goal uses valid predicates from the catalog
[67]

Do NOT use the "Stack" predicate (currently unsupported in the 49simulator)
[68]

For kitchen scenes use LIBERO_Kitchen_Tabletop_Manipulation, for table 51scenes use LIBERO_Tabletop_Manipulation
[69]

no broken 56combos apply

**Inspect against the KNOWN ENV LIMITATIONS list above.** If your 53intended (predicate, container) combo is on it, pick a different 54container OR a different predicate. State in ‘reasoning‘ that you 55checked the list and explain the substitution, or note "no broken 56combos apply" if the list is empty for your candidate. 57 58Respond in JSON with keys:...
[70]

The problem class name MUST be one of the listed scene types (e.g., 64LIBERO_Kitchen_Tabletop_Manipulation). 37
[71]

Every object and fixture in (:init) must have a placement region defined 66in (:regions)
[72]

68Keep ranges small (0.02 wide)

Region ranges are (x_min y_min x_max y_max) relative to the workspace. 68Keep ranges small (0.02 wide). Stay within [-0.45, 0.45] for x and [-0.35, 690.35] for y
[73]

Fixtures that go on the workspace also need an init region and an (On 71fixture workspace_region) in (:init)
[74]

Fixture sub-regions (like top_region for cabinet, cook_region for stove, 73heating_region for microwave) use (:target fixture_instance) with NO 74(:ranges) — they are predefined by the fixture
[75]

Objects of interest ((:obj_of_interest)) should include the 76objects/regions that appear in the goal
[76]

Use And to combine multiple goal predicates
[77]

Object instances are numbered: butter_1, akita_black_bowl_1, 79akita_black_bowl_2, etc
[78]

{workspace}_

CRITICAL: In (:regions), region names must NOT include the workspace 81prefix. LIBERO auto-prepends "{workspace}_" to region names. So define 82"butter_init_region" (NOT "kitchen_table_butter_init_region"). In (:init) 83and (:goal), use the FULL prefixed name: "kitchen_table_butter_init_region"
[79]

In (:goal), fixture sub-regions are prefixed: {fixture_instance}_{sub} 85(e.g., wooden_cabinet_1_top_region)
[80]

table",

CRITICAL: In (:fixtures), the workspace type is the LOWERCASE workspace 87type from the catalog (e.g., "table", "kitchen_table"), NOT the problem 88class name. Example: "main_table - table" or "kitchen_table - 89kitchen_table"

Showing first 80 references.

[1] [1]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023

2023

[2] [2]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. ProgPrompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523– 11530. IEEE, 2023

2023

[3] [3]

Vemprala, R

S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. ChatGPT for robotics: Design principles and model abilities.IEEE Access, 12:55682–55696, 2024

2024

[4] [4]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023

[5] [5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, 9 K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilin- sky.π 0: A vision-language-action flow model for general robot control. InProcee...

2025

[6] [6]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tan...

2025

[7] [7]

M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. Fan. CaP-X: A framework for benchmarking and improving coding agents for robot manipulation.arXiv preprint arXiv:2603.22435, 2026

arXiv 2026

[8] [8]

Piaget.The Origins of Intelligence in Children

J. Piaget.The Origins of Intelligence in Children. International Universities Press, New York, 1952

1952

[9] [9]

A. Gopnik. Childhood as a solution to explore-exploit tensions.Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1803):20190502, 2020. doi:10.1098/rstb.2019. 0502

work page doi:10.1098/rstb.2019 2020

[10] [10]

L. B. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1-2):13–29, 2005. doi:10.1162/1064546053278973

work page doi:10.1162/1064546053278973 2005

[11] [11]

Schmidhuber

J. Schmidhuber. Curious model-building control systems. InProceedings of the International Joint Conference on Neural Networks (IJCNN), volume 2, pages 1458–1463, Singapore, 1991. IEEE

1991

[12] [12]

Oudeyer, F

P.-Y . Oudeyer, F. Kaplan, and V . V . Hafner. Intrinsic motivation systems for autonomous mental development.IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007. doi:10.1109/TEVC.2006.890271

work page doi:10.1109/tevc.2006.890271 2007

[13] [13]

Baranes and P.-Y

A. Baranes and P.-Y . Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots.Robotics and Autonomous Systems, 61(1):49–73, 2013

2013

[14] [14]

Forestier and P.-Y

S. Forestier and P.-Y . Oudeyer. Modular active curiosity-driven discovery of tool use. In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3965– 3972, Daejeon, Korea, 2016. IEEE

2016

[15] [15]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. InProceedings of the 34th International Conference on Machine Learn- ing (ICML), volume 70 ofProceedings of Machine Learning Research, pages 2778–2787, 2017

2017

[16] [16]

Houthooft, X

R. Houthooft, X. Chen, Y . Duan, J. Schulman, F. De Turck, and P. Abbeel. VIME: Variational information maximizing exploration. InAdvances in Neural Information Processing Systems 29 (NeurIPS), 2016

2016

[17] [17]

L. S. Vygotsky.Mind in Society: The Development of Higher Psychological Processes. Har- vard University Press, Cambridge, MA, 1978. Edited by Michael Cole, Vera John-Steiner, Sylvia Scribner, and Ellen Souberman

1978

[18] [18]

A. D. Pellegrini.The Role of Play in Human Development. Oxford University Press, Oxford, UK, 2009. ISBN 9780195367324

2009

[19] [19]

Schmidhuber

J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010).IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010. 10

1990

[20] [20]

Kaplan and P.-Y

F. Kaplan and P.-Y . Oudeyer. In search of the neural circuits of intrinsic motivation.Frontiers in Neuroscience, 1(1):225–236, 2007. doi:10.3389/neuro.01.1.1.017.2007

work page doi:10.3389/neuro.01.1.1.017.2007 2007

[21] [21]

C. Kidd, S. T. Piantadosi, and R. N. Aslin. The goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex.PLOS ONE, 7(5): e36399, 2012. doi:10.1371/journal.pone.0036399

work page doi:10.1371/journal.pone.0036399 2012

[22] [22]

M. Rolf, J. J. Steil, and M. Gienger. Goal babbling permits direct learning of inverse kine- matics.IEEE Transactions on Autonomous Mental Development, 2(3):216–229, 2010. doi: 10.1109/TAMD.2010.2062511

work page doi:10.1109/tamd.2010.2062511 2010

[23] [23]

Forestier, R

S. Forestier, R. Portelas, Y . Mollard, and P.-Y . Oudeyer. Intrinsically motivated goal explo- ration processes with automatic curriculum learning.Journal of Machine Learning Research, 23(152):1–41, 2022

2022

[24] [24]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In L. P. Kaelbling, D. Kragic, and K. Sugiura, editors,Proceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1113–1132. PMLR, 30 Oct–01 Nov 2020. URLhttps://proceedings.mlr.press/ v100/...

2020

[25] [25]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimic- play: Long-horizon imitation learning by watching human play. In J. Tan, M. Toussaint, and K. Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pages 201–221. PMLR, 06–09 Nov 2023. URL https:/...

2023

[26] [26]

Colas, T

C. Colas, T. Karch, N. Lair, J.-M. Dussoux, C. Moulin-Frier, P. F. Dominey, and P.-Y . Oudeyer. Language as a cognitive tool to imagine goals in curiosity-driven exploration. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020

[27] [27]

Thrun and T

S. Thrun and T. M. Mitchell. Lifelong robot learning.Robotics and Autonomous Systems, 15 (1–2):25–46, 1995

1995

[28] [28]

G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019. doi:10.1016/j.neunet.2019. 01.012

work page doi:10.1016/j.neunet.2019 2019

[29] [29]

Lesort, V

T. Lesort, V . Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58:52–68, 2020

2020

[30] [30]

R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for tem- poral abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211, 1999

1999

[31] [31]

Konidaris, S

G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto. Robot learning from demonstration by constructing skill trees.The International Journal of Robotics Research, 31(3):360–375, 2012

2012

[32] [32]

Kroemer, S

O. Kroemer, S. Niekum, and G. Konidaris. A review of robot learning for manipulation: Challenges, representations, and algorithms.Journal of Machine Learning Research, 22(30): 1–82, 2021

2021

[33] [33]

Pertsch, Y

K. Pertsch, Y . Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. InProceedings of the 4th Conference on Robot Learning (CoRL), volume 155 ofProceedings of Machine Learning Research, pages 188–204, 2020

2020

[34] [34]

Lynch and P

C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. InProceedings of Robotics: Science and Systems (RSS), 2021. doi:10.15607/RSS.2021.XVII. 047. 11

work page doi:10.15607/rss.2021.xvii 2021

[35] [35]

W. Wan, Y . Zhu, R. Shah, and Y . Zhu. Lotus: Continual imitation learning for robot manip- ulation through unsupervised skill discovery, 2024. URLhttps://arxiv.org/abs/2311. 02058

2024

[36] [36]

Y . J. Ma, W. Liang, H.-J. Wang, S. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. DrEureka: Language model guided sim-to-real transfer. InRobotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2406.01967

arXiv 2024

[37] [37]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 41–48. ACM, 2009

2009

[38] [38]

Narvekar, B

S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research, 21(181):1–50, 2020

2020

[39] [39]

Florensa, D

C. Florensa, D. Held, X. Geng, and P. Abbeel. Automatic goal generation for reinforcement learning agents. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1515–1528, 2018

2018

[40] [40]

A. Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. InAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

2018

[41] [41]

V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-Fit: State-covering self- supervised reinforcement learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 7783–7792, 2020

2020

[42] [42]

Seita, D

D. Seita, D. Chan, R. Rao, C. Tang, M. Zhao, and J. Canny. ZPD teaching strategies for deep reinforcement learning from demonstrations. InDeep Reinforcement Learning Workshop at NeurIPS, 2019

2019

[43] [43]

Y . Mu, J. Chen, Q. Zhang, S. Chen, Q. Yu, C. Ge, R. Chen, Z. Liang, M. Hu, C. Tao, P. Sun, H. Yu, C. Yang, W. Shao, W. Wang, J. Dai, Y . Qiao, M. Ding, and P. Luo. Robocodex: Multimodal code generation for robotic behavior synthesis, 2024. URLhttps://arxiv. org/abs/2402.16117

arXiv 2024

[44] [44]

G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Bal- akrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

Pith/arXiv arXiv 2025

[45] [45]

J. Shi, R. Yang, K. Chao, B. S. Wan, Y . S. Shao, J. Lei, J. Qian, L. Le, P. Chaudhari, K. Dani- ilidis, et al. Maestro: Orchestrating robotics modules with vision-language models for zero- shot generalist robots. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embod- ied AI, 2025

2025

[46] [46]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettingh...

2022

[47] [47]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, 12 K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model. InProceedings of the 40th International Conf...

2023

[48] [48]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. InProceedings of the 7th Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, 2023

2023

[49] [49]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence. Socratic models: Composing zero-shot multimodal reasoning with language. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[50] [50]

Y . Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y . Qiao, and P. Luo. EmbodiedGPT: Vision-language pre-training via embodied chain of thought. InAdvances in Neural Information Processing Systems 36 (NeurIPS), pages 25081–25094, 2023

2023

[51] [51]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023

[52] [52]

Shinn, F

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. InAdvances in Neural Information Process- ing Systems 36 (NeurIPS), 2023

2023

[53] [53]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prab- humoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Informa- tion Processing Systems 36 (NeurIPS), 2023

2023

[54] [54]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[55] [55]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oy- ager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research (TMLR), 2024

2024

[56] [56]

K. Lin, C. Snell, Y . Wang, C. Packer, S. Wooders, I. Stoica, and J. E. Gonzalez. Sleep-time compute: Beyond inference scaling at test-time.arXiv preprint arXiv:2504.13171, 2025

arXiv 2025

[57] [57]

Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, K. Lin, S. Nasiriany, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

Pith/arXiv arXiv 2009

[58] [58]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[59] [59]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025

[60] [60]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A large-scale open ecosystem for robot navigation and manipulation.arXiv preprint arXiv:2602.11337, 2026

arXiv 2026

[61] [61]

I’m curious if I can lift the brown tissue box straight up into the air

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, 2024. 13 Appendix Table of Contents A Implementation Details about RATS15 A.1 Details of Play-Time Task Proposal . . . . . . . . . . . . . ....

2024

[62] [62]

Review the task history — what worked, what failed, and WHY?

[63] [63]

If recent tasks failed, consider whether to try a simpler variant or a 41completely different approach

[64] [64]

If recent tasks succeeded, consider building on those skills with 43something slightly harder — but still single-step and child-feasible

[65] [65]

45The target object (arg1 of an ‘On‘/‘In‘ goal) MUST come from the RELIABLE 46list in the Pick-Primitive Reliability block

Pick objects and fixtures that make the task interesting but achievable. 45The target object (arg1 of an ‘On‘/‘In‘ goal) MUST come from the RELIABLE 46list in the Pick-Primitive Reliability block

[66] [66]

Ensure the goal uses valid predicates from the catalog

[67] [67]

Do NOT use the "Stack" predicate (currently unsupported in the 49simulator)

[68] [68]

For kitchen scenes use LIBERO_Kitchen_Tabletop_Manipulation, for table 51scenes use LIBERO_Tabletop_Manipulation

[69] [69]

no broken 56combos apply

**Inspect against the KNOWN ENV LIMITATIONS list above.** If your 53intended (predicate, container) combo is on it, pick a different 54container OR a different predicate. State in ‘reasoning‘ that you 55checked the list and explain the substitution, or note "no broken 56combos apply" if the list is empty for your candidate. 57 58Respond in JSON with keys:...

[70] [70]

The problem class name MUST be one of the listed scene types (e.g., 64LIBERO_Kitchen_Tabletop_Manipulation). 37

[71] [71]

Every object and fixture in (:init) must have a placement region defined 66in (:regions)

[72] [72]

68Keep ranges small (0.02 wide)

Region ranges are (x_min y_min x_max y_max) relative to the workspace. 68Keep ranges small (0.02 wide). Stay within [-0.45, 0.45] for x and [-0.35, 690.35] for y

[73] [73]

Fixtures that go on the workspace also need an init region and an (On 71fixture workspace_region) in (:init)

[74] [74]

Fixture sub-regions (like top_region for cabinet, cook_region for stove, 73heating_region for microwave) use (:target fixture_instance) with NO 74(:ranges) — they are predefined by the fixture

[75] [75]

Objects of interest ((:obj_of_interest)) should include the 76objects/regions that appear in the goal

[76] [76]

Use And to combine multiple goal predicates

[77] [77]

Object instances are numbered: butter_1, akita_black_bowl_1, 79akita_black_bowl_2, etc

[78] [78]

{workspace}_

CRITICAL: In (:regions), region names must NOT include the workspace 81prefix. LIBERO auto-prepends "{workspace}_" to region names. So define 82"butter_init_region" (NOT "kitchen_table_butter_init_region"). In (:init) 83and (:goal), use the FULL prefixed name: "kitchen_table_butter_init_region"

[79] [79]

In (:goal), fixture sub-regions are prefixed: {fixture_instance}_{sub} 85(e.g., wooden_cabinet_1_top_region)

[80] [80]

table",

CRITICAL: In (:fixtures), the workspace type is the LOWERCASE workspace 87type from the catalog (e.g., "table", "kitchen_table"), NOT the problem 88class name. Example: "main_table - table" or "kitchen_table - 89kitchen_table"