In-Context World Modeling for Robotic Control

Jingjing Gong; Junhao Shi; Li Ji; Senyu Fei; Siyin Wang; Xipeng Qiu; Zhaoyang Fu

arxiv: 2606.26025 · v2 · pith:ZX3VUJRMnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV

In-Context World Modeling for Robotic Control

Siyin Wang , Junhao Shi , Senyu Fei , Zhaoyang Fu , Li Ji , Jingjing Gong , Xipeng Qiu This is my paper

Pith reviewed 2026-06-26 05:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords in-context learningvision-language-action modelsrobot controlsystem identificationgeneralizationadaptation without fine-tuningworld modeling

0 comments

The pith

In-Context World Modeling lets robot policies infer system variables from short task-agnostic interaction histories to adapt to novel configurations without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Vision-Language-Action models fail on novel camera viewpoints or robot morphologies because they condition only on current observations and instructions, implicitly assuming fixed training contexts. ICWM reframes system identification as in-context adaptation, letting the policy process a brief history of its own task-agnostic interactions to capture current world dynamics before executing the task. This inference happens inside the existing context window and requires no fine-tuning or new parameters. A sympathetic reader would care because it removes the need for data-intensive retraining when deployment conditions change.

Core claim

ICWM treats system identification as an in-context adaptation problem. By processing a short history of self-generated, task-agnostic interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates.

What carries the argument

In-Context World Modeling (ICWM), which repurposes the model's context window to model system dynamics from interaction histories instead of using it only for task specification via demonstrations.

If this is right

Policies generalize to altered camera viewpoints without fine-tuning or additional data collection.
The same model can adapt to changes in robot morphologies using only the interaction history.
Outperformance over standard VLA baselines holds in both simulation and real-world robot platforms.
System identification occurs autonomously from self-generated data before any task is attempted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce the cost of collecting task-specific demonstration datasets by shifting adaptation work into the context window.
Similar context-based inference might apply to other control settings where the underlying plant parameters vary at deployment time.
If the interaction history can be made even shorter while remaining informative, the approach would become practical for very low-latency robot systems.

Load-bearing premise

A short history of task-agnostic interactions is sufficient to capture the full world dynamics needed for reliable adaptation across arbitrary novel configurations such as altered camera viewpoints or robot morphologies.

What would settle it

An experiment in which providing the short interaction history produces no measurable improvement in success rate on novel camera viewpoints compared with the baseline VLA model that receives no history.

read the original abstract

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICWM frames system ID as in-context adaptation from task-agnostic interactions, but the experiments only cover viewpoint changes despite claims about arbitrary configs like morphologies.

read the letter

The main point is that this paper treats system identification as an in-context problem: the robot runs a short sequence of its own random moves, the model reads that history to infer the current setup, and then the policy runs the task without any parameter updates. That is a distinct angle from standard in-context learning, which usually just specifies the task rather than the underlying dynamics.

It does a clean job naming the real limitation in current VLAs—they condition only on current observations and instructions, so they bake in the training environment and need heavy fine-tuning for anything new. The proposed fix is practical in intent.

The soft spots are in the evidence and scope. The abstract states significant outperformance on novel viewpoints in simulation and on real robots, yet supplies no numbers, baselines, error bars, or dataset details. More critically, the framing extends to arbitrary novel configurations including robot morphologies, but the reported experiments address only camera viewpoints. A short task-agnostic history may not reveal kinematic or visual differences that come with morphology changes, so the central assumption is not yet tested at the breadth claimed.

This is for people working on VLA generalization and embodied deployment. A reader focused on reducing environment-specific retraining would find the idea worth examining. It deserves peer review so the full methods and results can be checked against the broader claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces In-Context World Modeling (ICWM) for Vision-Language-Action (VLA) models. ICWM treats system identification as an in-context adaptation task, allowing policies to infer essential system variables from a short history of self-generated, task-agnostic interactions. This enables adaptation to novel configurations (e.g., camera viewpoints or morphologies) without parameter updates, in contrast to standard VLA models that assume fixed training contexts. The abstract reports that extensive experiments in simulation and on real robots show ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.

Significance. If the empirical results are supported by proper metrics, baselines, and controls, the framework could reduce reliance on data-intensive fine-tuning for robotic policies in new environments by leveraging the context window for world dynamics inference. The approach builds on in-context learning ideas but applies them to system identification rather than task specification.

major comments (2)

[Abstract] Abstract: the central claim states that ICWM enables adaptation to 'novel configurations' including altered robot morphologies, yet the reported experiments are described only for novel camera viewpoints; this scope mismatch is load-bearing because the abstract supplies no evidence that a short history of task-agnostic interactions suffices to identify kinematic or morphological differences.
[Abstract] Abstract: the assertion of 'significant outperformance' on novel viewpoints supplies no metrics, baselines, controls, error bars, dataset details, or quantitative results, preventing evaluation of the soundness of the central empirical claim.

minor comments (1)

The distinction drawn between traditional in-context learning (task specification via demonstrations) and ICWM (system dynamics inference) would benefit from a concrete illustrative example or pseudocode in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim states that ICWM enables adaptation to 'novel configurations' including altered robot morphologies, yet the reported experiments are described only for novel camera viewpoints; this scope mismatch is load-bearing because the abstract supplies no evidence that a short history of task-agnostic interactions suffices to identify kinematic or morphological differences.

Authors: The referee is correct that the abstract introduces adaptation to novel configurations (including morphologies) in the opening sentences but then reports empirical results exclusively for novel camera viewpoints. The manuscript does not provide experiments or evidence demonstrating identification of kinematic or morphological differences via short task-agnostic histories. We will revise the abstract to remove the overbroad claim and align the stated contributions precisely with the evaluated scope. revision: yes
Referee: [Abstract] Abstract: the assertion of 'significant outperformance' on novel viewpoints supplies no metrics, baselines, controls, error bars, dataset details, or quantitative results, preventing evaluation of the soundness of the central empirical claim.

Authors: Abstracts are length-limited and conventionally summarize findings at a high level; the full quantitative results, including metrics, baselines, controls, error bars, and dataset details, appear in the experimental sections of the manuscript. We agree the abstract's phrasing is too vague on its own and will add a concise statement of the key performance gains (with reference to the detailed tables) to strengthen the summary. revision: partial

Circularity Check

0 steps flagged

No circularity detected; no derivations or equations present

full rationale

The provided manuscript text, including the abstract and full-text placeholder, contains no equations, mathematical derivations, parameter-fitting procedures, or load-bearing self-citations. The ICWM framework is introduced conceptually as treating system identification as in-context adaptation, with claims about inferring variables from interaction history, but without any formal chain that reduces predictions to inputs by construction. All described patterns (self-definitional, fitted-input-as-prediction, etc.) require explicit reductions via equations or citations, none of which appear. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5724 in / 1000 out tokens · 20199 ms · 2026-06-26T05:09:02.685402+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

2023
[2]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Foster, Grace Lam, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. ArXiv, abs/2406.09246, 2024. ...

Pith/arXiv arXiv 2024
[3]

π0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550. arXiv preprint ARXIV .2410.24164

Pith/arXiv arXiv 2024
[4]

Senyu Fei, Siyin Wang, Junhao Shi, Z. G. Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models. ArXiv, abs/2510.13626, 2025. URL https://api.semanticscholar.org/CorpusID:282102298

Pith/arXiv arXiv 2025
[5]

Goldberg

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, Will Panitch, Fangchen Liu, Hui Li, and Ken- neth Y . Goldberg. Icrt: In-context imitation learning via next-token prediction. 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5937–5944, 2024. URL https://api.semanticscholar.org/CorpusID: 271974730

2025
[6]

Mimicdroid: In-context learning for humanoid robot manipulation from human play videos

Rutav Shah, Shuĳing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto Mart’in-Mart’in, and Yuke Zhu. Mimicdroid: In-context learning for humanoid robot manipulation from human play videos. ArXiv, abs/2509.09769, 2025. URL https://api.semanticscholar.org/CorpusID:281309736

arXiv 2025
[7]

Language models are unsu- pervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

2019
[8]

Llama: Open and efficient foundation language models

Hugo T ouvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/Co...

Pith/arXiv arXiv 2023
[9]

Introducing chatgpt, 2022

OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt

2022
[10]

T om B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Li...

Pith/arXiv arXiv 2005
[11]

URL https://api.semanticscholar.org/CorpusID:218971783
[12]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:255372865

2022
[13]

Vuong Dinh An, Minh Nhat Vu, Dong An, and Ian D. Reid. Action tokenizer matters in in-context imitation learning. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496, 2025. URL https://api.semanticscholar.org/CorpusID:276742267. 11

2025
[14]

Ricl: Adding in-context adaptability to pre-trained vision-language-action models

Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. ArXiv, abs/2508.02062, 2025. URL https://api.semanticscholar. org/CorpusID:280422322

arXiv 2025
[15]

Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R

Vidhi Jain, Maria Attarian, Nikhil J. Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R. Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, and Debidatta Dwibedi. Vid2robot: End- to-end video-conditioned policy learning with cross-attention transformers. ArXiv, abs/2403.12943, 2024. URL https://api.semanticsch...

arXiv 2024
[16]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

2017
[17]

Meta-learning with implicit gradients

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019

2019
[18]

Rl 2: Fast reinforcement learning via slow reinforcement learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

Pith/arXiv arXiv 2016
[19]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon White- son. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

arXiv 1910
[20]

Recurrent world models facilitate policy evolution

David R Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Neural Information Processing Systems, 2018. URL https://api.semanticscholar.org/CorpusID:52171619

2018
[21]

A path towards autonomous machine intelligence version 0.9.2, 2022-06-27

Yann LeCun and Courant. A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. 2022. URL https://api.semanticscholar.org/CorpusID:251881108

2022
[22]

Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, J. Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58:1 – 38, 2024. URL https://api.semanticscholar.org/CorpusID: 274192171

2024
[23]

World modeling makes a better planner: Dual preference optimization for embodied task planning

Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World modeling makes a better planner: Dual preference optimization for embodied task planning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21518–21537, 2025

2025
[24]

World action models: The next frontier in embodied ai

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, et al. World action models: The next frontier in embodied ai. arXiv preprint arXiv:2605.12090, 2026

Pith/arXiv arXiv 2026
[25]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chi-Hou Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. ArXiv, abs/2312.13139, 2023. URL https://api.semanticscholar.org/CorpusID:266374724

Pith/arXiv arXiv 2023
[26]

Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024

Peiyan Li, Hongtao Wu, Yan Huang, Chi-Hou Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024. URL https://api.semanticscholar.org/CorpusID:271957548

1912
[27]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung- Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pa...

2025
[28]

Worldvla: T owards autoregressive action world model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: T owards autoregressive action world model. ArXiv, abs/2506.21539,

Pith/arXiv arXiv
[29]

URL https://api.semanticscholar.org/CorpusID:280010695
[30]

Lillicrap, Jimmy Ba, and Mohammad Norouzi

Danĳar Hafner, Timothy P . Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2019. URL https://api.semanticscholar.org/CorpusID:208547755. 12

Pith/arXiv arXiv 1912
[31]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danĳar Hafner, Timothy P . Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. ArXiv, abs/2010.02193, 2020. URL https://api.semanticscholar.org/CorpusID:222133157

Pith/arXiv arXiv 2010
[32]

Philipp Wu, Alejandro Escontrela, Danĳar Hafner, Ken Goldberg, and P . Abbeel. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/ CorpusID:250088882

2022
[33]

Pašukonis, Jimmy Ba, and Timothy P

Danĳar Hafner, J. Pašukonis, Jimmy Ba, and Timothy P . Lillicrap. Mastering diverse domains through world models. ArXiv, abs/2301.04104, 2023. URL https://api.semanticscholar.org/CorpusID:255569874

Pith/arXiv arXiv 2023
[34]

Flare: Robot learning with implicit world modeling

Ruĳie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. Flare: Robot learning with implicit world modeling. ArXiv, abs/2505.15659, 2025. URL h...

Pith/arXiv arXiv 2025
[35]

Tenenbaum, Dale Schuurmans, and P

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and P . Abbeel. Learning universal policies via text-guided video generation. ArXiv, abs/2302.00111, 2023. URL https://api.semanticscholar.org/CorpusID:256459809

arXiv 2023
[36]

Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu, and Linxi Fan

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xi Zeng, Kaiyuan Zheng, Ruĳie Zheng, Ming-Yu Liu, Luke S. Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu...

2025
[37]

Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. ArXiv, abs/2412.15109, 2024. URL https: //api.semanticscholar.org/CorpusID:274859727

Pith/arXiv arXiv 2024
[38]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokhov , Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673

arXiv 2022
[39]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. ArXiv, abs/2504.02792, 2025. URL https://api.semanticscholar.org/CorpusID:277510147

Pith/arXiv arXiv 2025
[40]

Unified video action model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. ArXiv, abs/2503.00200, 2025. URL https://api.semanticscholar.org/CorpusID:276741531

Pith/arXiv arXiv 2025
[41]

LIBERO: bench- marking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: bench- marking knowledge transfer for lifelong robot learning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proce...

2023
[42]

Nora: A small open-sourced generalist vision language action model for embodied tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks. ArXiv, abs/2504.19854,

Pith/arXiv arXiv
[43]

URL https://api.semanticscholar.org/CorpusID:278165428
[44]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST : efficient action tokenization for vision-language-action models. CoRR, abs/2501.09747, 2025. doi: 10.48550/ARXIV.2501.09747. URL https://doi.org/10.48550/arXiv.2501.09747

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.09747 2025
[45]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025
[46]

pick up the black bowl next to the plate and place it on the plate

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. Ar...

Pith/arXiv arXiv 2025
[47]

Put the toy on the box into the basket

Spatial Reasoning & Disambiguation: “Put the toy on the box into the basket.” This task requires the agent to understand the vertical spatial relationship between the toy and the box, necessitating precise end- effector positioning to take the toy without disturbing the support surface ia process highly sensitive to viewpoint-induced depth errors 19 Camer...
[48]

Stack the yellow cup onto the red cup

Fine-grained Alignment: “Stack the yellow cup onto the red cup.” This serves as a benchmark for high-precision motor control, where the agent must align the principal axes of two objects under novel perspective projections
[49]

Lift the basket

Structural Manipulation: “Lift the basket.” This task focuses on handle-centric grasping of large-scale empty containers, testing the model’s ability to ground actions on specific structural af- fordances of an object
[50]

Pick up the eggplant and place it onto the red plate

Multi-Object Semantic Grounding: “Pick up the eggplant and place it onto the red plate.” Con- ducted in a cluttered scene with multiple distrac- tor objects, this task assesses the model’s ability to maintain correct object-instruction alignment when viewed from unfamiliar angles that may cause occlusion or visual overlap. For task-specific knowledge, we ...

[1] [1]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

2023

[2] [2]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Foster, Grace Lam, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. ArXiv, abs/2406.09246, 2024. ...

Pith/arXiv arXiv 2024

[3] [3]

π0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550. arXiv preprint ARXIV .2410.24164

Pith/arXiv arXiv 2024

[4] [4]

Senyu Fei, Siyin Wang, Junhao Shi, Z. G. Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models. ArXiv, abs/2510.13626, 2025. URL https://api.semanticscholar.org/CorpusID:282102298

Pith/arXiv arXiv 2025

[5] [5]

Goldberg

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, Will Panitch, Fangchen Liu, Hui Li, and Ken- neth Y . Goldberg. Icrt: In-context imitation learning via next-token prediction. 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5937–5944, 2024. URL https://api.semanticscholar.org/CorpusID: 271974730

2025

[6] [6]

Mimicdroid: In-context learning for humanoid robot manipulation from human play videos

Rutav Shah, Shuĳing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto Mart’in-Mart’in, and Yuke Zhu. Mimicdroid: In-context learning for humanoid robot manipulation from human play videos. ArXiv, abs/2509.09769, 2025. URL https://api.semanticscholar.org/CorpusID:281309736

arXiv 2025

[7] [7]

Language models are unsu- pervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

2019

[8] [8]

Llama: Open and efficient foundation language models

Hugo T ouvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/Co...

Pith/arXiv arXiv 2023

[9] [9]

Introducing chatgpt, 2022

OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt

2022

[10] [10]

T om B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Li...

Pith/arXiv arXiv 2005

[11] [11]

URL https://api.semanticscholar.org/CorpusID:218971783

[12] [12]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:255372865

2022

[13] [13]

Vuong Dinh An, Minh Nhat Vu, Dong An, and Ian D. Reid. Action tokenizer matters in in-context imitation learning. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496, 2025. URL https://api.semanticscholar.org/CorpusID:276742267. 11

2025

[14] [14]

Ricl: Adding in-context adaptability to pre-trained vision-language-action models

Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. ArXiv, abs/2508.02062, 2025. URL https://api.semanticscholar. org/CorpusID:280422322

arXiv 2025

[15] [15]

Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R

Vidhi Jain, Maria Attarian, Nikhil J. Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R. Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, and Debidatta Dwibedi. Vid2robot: End- to-end video-conditioned policy learning with cross-attention transformers. ArXiv, abs/2403.12943, 2024. URL https://api.semanticsch...

arXiv 2024

[16] [16]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

2017

[17] [17]

Meta-learning with implicit gradients

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019

2019

[18] [18]

Rl 2: Fast reinforcement learning via slow reinforcement learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

Pith/arXiv arXiv 2016

[19] [19]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon White- son. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

arXiv 1910

[20] [20]

Recurrent world models facilitate policy evolution

David R Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Neural Information Processing Systems, 2018. URL https://api.semanticscholar.org/CorpusID:52171619

2018

[21] [21]

A path towards autonomous machine intelligence version 0.9.2, 2022-06-27

Yann LeCun and Courant. A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. 2022. URL https://api.semanticscholar.org/CorpusID:251881108

2022

[22] [22]

Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, J. Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58:1 – 38, 2024. URL https://api.semanticscholar.org/CorpusID: 274192171

2024

[23] [23]

World modeling makes a better planner: Dual preference optimization for embodied task planning

Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World modeling makes a better planner: Dual preference optimization for embodied task planning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21518–21537, 2025

2025

[24] [24]

World action models: The next frontier in embodied ai

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, et al. World action models: The next frontier in embodied ai. arXiv preprint arXiv:2605.12090, 2026

Pith/arXiv arXiv 2026

[25] [25]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chi-Hou Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. ArXiv, abs/2312.13139, 2023. URL https://api.semanticscholar.org/CorpusID:266374724

Pith/arXiv arXiv 2023

[26] [26]

Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024

Peiyan Li, Hongtao Wu, Yan Huang, Chi-Hou Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024. URL https://api.semanticscholar.org/CorpusID:271957548

1912

[27] [27]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung- Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pa...

2025

[28] [28]

Worldvla: T owards autoregressive action world model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: T owards autoregressive action world model. ArXiv, abs/2506.21539,

Pith/arXiv arXiv

[29] [29]

URL https://api.semanticscholar.org/CorpusID:280010695

[30] [30]

Lillicrap, Jimmy Ba, and Mohammad Norouzi

Danĳar Hafner, Timothy P . Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2019. URL https://api.semanticscholar.org/CorpusID:208547755. 12

Pith/arXiv arXiv 1912

[31] [31]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danĳar Hafner, Timothy P . Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. ArXiv, abs/2010.02193, 2020. URL https://api.semanticscholar.org/CorpusID:222133157

Pith/arXiv arXiv 2010

[32] [32]

Philipp Wu, Alejandro Escontrela, Danĳar Hafner, Ken Goldberg, and P . Abbeel. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/ CorpusID:250088882

2022

[33] [33]

Pašukonis, Jimmy Ba, and Timothy P

Danĳar Hafner, J. Pašukonis, Jimmy Ba, and Timothy P . Lillicrap. Mastering diverse domains through world models. ArXiv, abs/2301.04104, 2023. URL https://api.semanticscholar.org/CorpusID:255569874

Pith/arXiv arXiv 2023

[34] [34]

Flare: Robot learning with implicit world modeling

Ruĳie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. Flare: Robot learning with implicit world modeling. ArXiv, abs/2505.15659, 2025. URL h...

Pith/arXiv arXiv 2025

[35] [35]

Tenenbaum, Dale Schuurmans, and P

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and P . Abbeel. Learning universal policies via text-guided video generation. ArXiv, abs/2302.00111, 2023. URL https://api.semanticscholar.org/CorpusID:256459809

arXiv 2023

[36] [36]

Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu, and Linxi Fan

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xi Zeng, Kaiyuan Zheng, Ruĳie Zheng, Ming-Yu Liu, Luke S. Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu...

2025

[37] [37]

Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. ArXiv, abs/2412.15109, 2024. URL https: //api.semanticscholar.org/CorpusID:274859727

Pith/arXiv arXiv 2024

[38] [38]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokhov , Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673

arXiv 2022

[39] [39]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. ArXiv, abs/2504.02792, 2025. URL https://api.semanticscholar.org/CorpusID:277510147

Pith/arXiv arXiv 2025

[40] [40]

Unified video action model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. ArXiv, abs/2503.00200, 2025. URL https://api.semanticscholar.org/CorpusID:276741531

Pith/arXiv arXiv 2025

[41] [41]

LIBERO: bench- marking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: bench- marking knowledge transfer for lifelong robot learning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proce...

2023

[42] [42]

Nora: A small open-sourced generalist vision language action model for embodied tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks. ArXiv, abs/2504.19854,

Pith/arXiv arXiv

[43] [43]

URL https://api.semanticscholar.org/CorpusID:278165428

[44] [44]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST : efficient action tokenization for vision-language-action models. CoRR, abs/2501.09747, 2025. doi: 10.48550/ARXIV.2501.09747. URL https://doi.org/10.48550/arXiv.2501.09747

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.09747 2025

[45] [45]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025

[46] [46]

pick up the black bowl next to the plate and place it on the plate

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. Ar...

Pith/arXiv arXiv 2025

[47] [47]

Put the toy on the box into the basket

Spatial Reasoning & Disambiguation: “Put the toy on the box into the basket.” This task requires the agent to understand the vertical spatial relationship between the toy and the box, necessitating precise end- effector positioning to take the toy without disturbing the support surface ia process highly sensitive to viewpoint-induced depth errors 19 Camer...

[48] [48]

Stack the yellow cup onto the red cup

Fine-grained Alignment: “Stack the yellow cup onto the red cup.” This serves as a benchmark for high-precision motor control, where the agent must align the principal axes of two objects under novel perspective projections

[49] [49]

Lift the basket

Structural Manipulation: “Lift the basket.” This task focuses on handle-centric grasping of large-scale empty containers, testing the model’s ability to ground actions on specific structural af- fordances of an object

[50] [50]

Pick up the eggplant and place it onto the red plate

Multi-Object Semantic Grounding: “Pick up the eggplant and place it onto the red plate.” Con- ducted in a cluttered scene with multiple distrac- tor objects, this task assesses the model’s ability to maintain correct object-instruction alignment when viewed from unfamiliar angles that may cause occlusion or visual overlap. For task-specific knowledge, we ...