pith. sign in

arxiv: 2606.26025 · v2 · pith:ZX3VUJRMnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV

In-Context World Modeling for Robotic Control

Pith reviewed 2026-06-26 05:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords in-context learningvision-language-action modelsrobot controlsystem identificationgeneralizationadaptation without fine-tuningworld modeling
0
0 comments X

The pith

In-Context World Modeling lets robot policies infer system variables from short task-agnostic interaction histories to adapt to novel configurations without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Vision-Language-Action models fail on novel camera viewpoints or robot morphologies because they condition only on current observations and instructions, implicitly assuming fixed training contexts. ICWM reframes system identification as in-context adaptation, letting the policy process a brief history of its own task-agnostic interactions to capture current world dynamics before executing the task. This inference happens inside the existing context window and requires no fine-tuning or new parameters. A sympathetic reader would care because it removes the need for data-intensive retraining when deployment conditions change.

Core claim

ICWM treats system identification as an in-context adaptation problem. By processing a short history of self-generated, task-agnostic interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates.

What carries the argument

In-Context World Modeling (ICWM), which repurposes the model's context window to model system dynamics from interaction histories instead of using it only for task specification via demonstrations.

If this is right

  • Policies generalize to altered camera viewpoints without fine-tuning or additional data collection.
  • The same model can adapt to changes in robot morphologies using only the interaction history.
  • Outperformance over standard VLA baselines holds in both simulation and real-world robot platforms.
  • System identification occurs autonomously from self-generated data before any task is attempted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could reduce the cost of collecting task-specific demonstration datasets by shifting adaptation work into the context window.
  • Similar context-based inference might apply to other control settings where the underlying plant parameters vary at deployment time.
  • If the interaction history can be made even shorter while remaining informative, the approach would become practical for very low-latency robot systems.

Load-bearing premise

A short history of task-agnostic interactions is sufficient to capture the full world dynamics needed for reliable adaptation across arbitrary novel configurations such as altered camera viewpoints or robot morphologies.

What would settle it

An experiment in which providing the short interaction history produces no measurable improvement in success rate on novel camera viewpoints compared with the baseline VLA model that receives no history.

read the original abstract

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces In-Context World Modeling (ICWM) for Vision-Language-Action (VLA) models. ICWM treats system identification as an in-context adaptation task, allowing policies to infer essential system variables from a short history of self-generated, task-agnostic interactions. This enables adaptation to novel configurations (e.g., camera viewpoints or morphologies) without parameter updates, in contrast to standard VLA models that assume fixed training contexts. The abstract reports that extensive experiments in simulation and on real robots show ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.

Significance. If the empirical results are supported by proper metrics, baselines, and controls, the framework could reduce reliance on data-intensive fine-tuning for robotic policies in new environments by leveraging the context window for world dynamics inference. The approach builds on in-context learning ideas but applies them to system identification rather than task specification.

major comments (2)
  1. [Abstract] Abstract: the central claim states that ICWM enables adaptation to 'novel configurations' including altered robot morphologies, yet the reported experiments are described only for novel camera viewpoints; this scope mismatch is load-bearing because the abstract supplies no evidence that a short history of task-agnostic interactions suffices to identify kinematic or morphological differences.
  2. [Abstract] Abstract: the assertion of 'significant outperformance' on novel viewpoints supplies no metrics, baselines, controls, error bars, dataset details, or quantitative results, preventing evaluation of the soundness of the central empirical claim.
minor comments (1)
  1. The distinction drawn between traditional in-context learning (task specification via demonstrations) and ICWM (system dynamics inference) would benefit from a concrete illustrative example or pseudocode in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim states that ICWM enables adaptation to 'novel configurations' including altered robot morphologies, yet the reported experiments are described only for novel camera viewpoints; this scope mismatch is load-bearing because the abstract supplies no evidence that a short history of task-agnostic interactions suffices to identify kinematic or morphological differences.

    Authors: The referee is correct that the abstract introduces adaptation to novel configurations (including morphologies) in the opening sentences but then reports empirical results exclusively for novel camera viewpoints. The manuscript does not provide experiments or evidence demonstrating identification of kinematic or morphological differences via short task-agnostic histories. We will revise the abstract to remove the overbroad claim and align the stated contributions precisely with the evaluated scope. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of 'significant outperformance' on novel viewpoints supplies no metrics, baselines, controls, error bars, dataset details, or quantitative results, preventing evaluation of the soundness of the central empirical claim.

    Authors: Abstracts are length-limited and conventionally summarize findings at a high level; the full quantitative results, including metrics, baselines, controls, error bars, and dataset details, appear in the experimental sections of the manuscript. We agree the abstract's phrasing is too vague on its own and will add a concise statement of the key performance gains (with reference to the detailed tables) to strengthen the summary. revision: partial

Circularity Check

0 steps flagged

No circularity detected; no derivations or equations present

full rationale

The provided manuscript text, including the abstract and full-text placeholder, contains no equations, mathematical derivations, parameter-fitting procedures, or load-bearing self-citations. The ICWM framework is introduced conceptually as treating system identification as in-context adaptation, with claims about inferring variables from interaction history, but without any formal chain that reduces predictions to inputs by construction. All described patterns (self-definitional, fitted-input-as-prediction, etc.) require explicit reductions via equations or citations, none of which appear. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5724 in / 1000 out tokens · 20199 ms · 2026-06-26T05:09:02.685402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

  2. [2]

    Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Foster, Grace Lam, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. ArXiv, abs/2406.09246, 2024. ...

  3. [3]

    π0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550. arXiv preprint ARXIV .2410.24164

  4. [4]

    Senyu Fei, Siyin Wang, Junhao Shi, Z. G. Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models. ArXiv, abs/2510.13626, 2025. URL https://api.semanticscholar.org/CorpusID:282102298

  5. [5]

    Goldberg

    Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, Will Panitch, Fangchen Liu, Hui Li, and Ken- neth Y . Goldberg. Icrt: In-context imitation learning via next-token prediction. 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5937–5944, 2024. URL https://api.semanticscholar.org/CorpusID: 271974730

  6. [6]

    Mimicdroid: In-context learning for humanoid robot manipulation from human play videos

    Rutav Shah, Shuijing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto Mart’in-Mart’in, and Yuke Zhu. Mimicdroid: In-context learning for humanoid robot manipulation from human play videos. ArXiv, abs/2509.09769, 2025. URL https://api.semanticscholar.org/CorpusID:281309736

  7. [7]

    Language models are unsu- pervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

  8. [8]

    Llama: Open and efficient foundation language models

    Hugo T ouvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/Co...

  9. [9]

    Introducing chatgpt, 2022

    OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt

  10. [10]

    T om B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Li...

  11. [11]

    URL https://api.semanticscholar.org/CorpusID:218971783

  12. [12]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:255372865

  13. [13]

    Vuong Dinh An, Minh Nhat Vu, Dong An, and Ian D. Reid. Action tokenizer matters in in-context imitation learning. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496, 2025. URL https://api.semanticscholar.org/CorpusID:276742267. 11

  14. [14]

    Ricl: Adding in-context adaptability to pre-trained vision-language-action models

    Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. ArXiv, abs/2508.02062, 2025. URL https://api.semanticscholar. org/CorpusID:280422322

  15. [15]

    Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R

    Vidhi Jain, Maria Attarian, Nikhil J. Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R. Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, and Debidatta Dwibedi. Vid2robot: End- to-end video-conditioned policy learning with cross-attention transformers. ArXiv, abs/2403.12943, 2024. URL https://api.semanticsch...

  16. [16]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

  17. [17]

    Meta-learning with implicit gradients

    Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019

  18. [18]

    Rl 2: Fast reinforcement learning via slow reinforcement learning

    Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

  19. [19]

    Varibad: A very good method for bayes-adaptive deep rl via meta-learning

    Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon White- son. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

  20. [20]

    Recurrent world models facilitate policy evolution

    David R Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Neural Information Processing Systems, 2018. URL https://api.semanticscholar.org/CorpusID:52171619

  21. [21]

    A path towards autonomous machine intelligence version 0.9.2, 2022-06-27

    Yann LeCun and Courant. A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. 2022. URL https://api.semanticscholar.org/CorpusID:251881108

  22. [22]

    Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, J. Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58:1 – 38, 2024. URL https://api.semanticscholar.org/CorpusID: 274192171

  23. [23]

    World modeling makes a better planner: Dual preference optimization for embodied task planning

    Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World modeling makes a better planner: Dual preference optimization for embodied task planning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21518–21537, 2025

  24. [24]

    World action models: The next frontier in embodied ai

    Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, et al. World action models: The next frontier in embodied ai. arXiv preprint arXiv:2605.12090, 2026

  25. [25]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chi-Hou Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. ArXiv, abs/2312.13139, 2023. URL https://api.semanticscholar.org/CorpusID:266374724

  26. [26]

    Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024

    Peiyan Li, Hongtao Wu, Yan Huang, Chi-Hou Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially- annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10:1912–1919, 2024. URL https://api.semanticscholar.org/CorpusID:271957548

  27. [27]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung- Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pa...

  28. [28]

    Worldvla: T owards autoregressive action world model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: T owards autoregressive action world model. ArXiv, abs/2506.21539,

  29. [29]

    URL https://api.semanticscholar.org/CorpusID:280010695

  30. [30]

    Lillicrap, Jimmy Ba, and Mohammad Norouzi

    Danijar Hafner, Timothy P . Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2019. URL https://api.semanticscholar.org/CorpusID:208547755. 12

  31. [31]

    Lillicrap, Mohammad Norouzi, and Jimmy Ba

    Danijar Hafner, Timothy P . Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. ArXiv, abs/2010.02193, 2020. URL https://api.semanticscholar.org/CorpusID:222133157

  32. [32]

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and P . Abbeel. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/ CorpusID:250088882

  33. [33]

    Pašukonis, Jimmy Ba, and Timothy P

    Danijar Hafner, J. Pašukonis, Jimmy Ba, and Timothy P . Lillicrap. Mastering diverse domains through world models. ArXiv, abs/2301.04104, 2023. URL https://api.semanticscholar.org/CorpusID:255569874

  34. [34]

    Flare: Robot learning with implicit world modeling

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. Flare: Robot learning with implicit world modeling. ArXiv, abs/2505.15659, 2025. URL h...

  35. [35]

    Tenenbaum, Dale Schuurmans, and P

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and P . Abbeel. Learning universal policies via text-guided video generation. ArXiv, abs/2302.00111, 2023. URL https://api.semanticscholar.org/CorpusID:256459809

  36. [36]

    Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu, and Linxi Fan

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xi Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke S. Zettlemoyer, Di- eter Fox, Jan Kautz, Scott Reed, Yuke Zhu...

  37. [37]

    Predictive inverse dynamics models are scalable learners for robotic manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. ArXiv, abs/2412.15109, 2024. URL https: //api.semanticscholar.org/CorpusID:274859727

  38. [38]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokhov , Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673

  39. [39]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. ArXiv, abs/2504.02792, 2025. URL https://api.semanticscholar.org/CorpusID:277510147

  40. [40]

    Unified video action model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. ArXiv, abs/2503.00200, 2025. URL https://api.semanticscholar.org/CorpusID:276741531

  41. [41]

    LIBERO: bench- marking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: bench- marking knowledge transfer for lifelong robot learning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proce...

  42. [42]

    Nora: A small open-sourced generalist vision language action model for embodied tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks. ArXiv, abs/2504.19854,

  43. [43]

    URL https://api.semanticscholar.org/CorpusID:278165428

  44. [44]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST : efficient action tokenization for vision-language-action models. CoRR, abs/2501.09747, 2025. doi: 10.48550/ARXIV.2501.09747. URL https://doi.org/10.48550/arXiv.2501.09747

  45. [45]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  46. [46]

    pick up the black bowl next to the plate and place it on the plate

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. Ar...

  47. [47]

    Put the toy on the box into the basket

    Spatial Reasoning & Disambiguation: “Put the toy on the box into the basket.” This task requires the agent to understand the vertical spatial relationship between the toy and the box, necessitating precise end- effector positioning to take the toy without disturbing the support surface ia process highly sensitive to viewpoint-induced depth errors 19 Camer...

  48. [48]

    Stack the yellow cup onto the red cup

    Fine-grained Alignment: “Stack the yellow cup onto the red cup.” This serves as a benchmark for high-precision motor control, where the agent must align the principal axes of two objects under novel perspective projections

  49. [49]

    Lift the basket

    Structural Manipulation: “Lift the basket.” This task focuses on handle-centric grasping of large-scale empty containers, testing the model’s ability to ground actions on specific structural af- fordances of an object

  50. [50]

    Pick up the eggplant and place it onto the red plate

    Multi-Object Semantic Grounding: “Pick up the eggplant and place it onto the red plate.” Con- ducted in a cluttered scene with multiple distrac- tor objects, this task assesses the model’s ability to maintain correct object-instruction alignment when viewed from unfamiliar angles that may cause occlusion or visual overlap. For task-specific knowledge, we ...