pith. machine review for the scientific record. sign in

arxiv: 2605.06481 · v1 · submitted 2026-05-07 · 💻 cs.RO

Recognition: unknown

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords object-addressable world modelrobot manipulationvision-language-actionslot-based representationscene perturbation robustnessaddress vectorworld action modelflow-matching action head
0
0 comments X

The pith

Decomposing scenes into object slots with persistent address vectors lets world-action models keep object identities separate from their changing appearances, improving robustness to perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that representing the world as addressable object slots, each carrying a stable identity vector alongside time-varying content, gives an action model a reliable way to refer to specific objects even when the scene shifts. Existing holistic image or latent representations entangle identity with context, so instructions about one object become unreliable after rearrangements. By routing attention through address keys only and resetting the address slice each layer, the model separates which object to act on from what that object currently looks like. If this separation holds, the same forward pass can predict both future slot states and a chunk of continuous actions while staying accurate on geometric manipulation tasks that involve object swaps or displacements.

Core claim

OA-WAM decomposes each frame into N+1 slots (one robot slot plus N object slots), each holding a persistent address vector and a time-varying content vector. These slots are fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts the next-frame slot states while a flow-matching action head decodes a 16-step action chunk in the same pass. Addressability is enforced by using address-only keys for cross-slot attention and resetting the address slice at every transformer layer, which keeps object identity decoupled from current state without extra tokens.

What carries the argument

Object slots that each store a persistent address vector for identity and a separate content vector for appearance, with cross-slot attention routed exclusively through the address keys and the address slice reset per layer.

If this is right

  • On LIBERO and SimplerEnv benchmarks the model matches or exceeds strong VLA and WAM baselines, with particular gains on geometric axes that require precise object reference.
  • The same architecture produces a swap-binding cosine of 0.87, far higher than the 0.09 ceiling of holistic baselines, showing that addressable slots preserve identity under perturbation.
  • A single forward pass jointly predicts next world states and action chunks, so no separate planning stage is required.
  • The slot count N is fixed at training time, yet performance holds on scenes whose object counts and types stay within the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If address vectors generalize beyond training object counts, the same mechanism could support open-vocabulary instructions that mention previously unseen objects by description alone.
  • Resetting the address slice each layer may also reduce interference when multiple objects are referenced in one instruction, suggesting a path to multi-object sequential tasks.
  • The separation of identity from content could be tested in real-robot settings by physically rearranging objects mid-episode and checking whether the policy follows the original address or the new visual content.

Load-bearing premise

The learned address vectors remain stable and separable across time steps and scene interventions even without explicit binding supervision.

What would settle it

Run the causal slot-intervention test on a new set of scenes: swap two objects after the first frame and measure whether the model still binds actions to the original address vector rather than the swapped content; a drop below 0.5 cosine similarity would falsify the claim of stable addressability.

Figures

Figures reproduced from arXiv: 2605.06481 by Fang Chen, Lingfeng Zhang, Peibo Sun, Shiyuan Dong, Shoujie Li, Wenbo Ding, Xiao-Ping Zhang, Xintao Chao, Yifan Xie, Yushan Liu.

Figure 1
Figure 1. Figure 1: Overview of OA-WAM. Under scene perturbations (left, six typical axes), holistic WAMs entangle target identity with context in global tokens and drift to wrong actions (top right). Our OA-WAM (bottom right) decomposes each frame into N+1 addressable object slots whose cross-slot attention key reads only the identity address subvector, keeping robust manipulation. (VLA) policies have made rapid progress on … view at source ↗
Figure 2
Figure 2. Figure 2: OA-WAM architecture. Multi-modal inputs are encoded into each token streams: object￾slot tokens via SAM3+DINOv3, projected by a learnable slot adapter; Only slot tokens introduce learnable parameters; the others reuse frozen embed_tokens. Tokens are assembled into a block￾causal sequence terminated by a learnable action query [ACT-Q] and processed by the slot-aware backbone. The world head reads slot hidde… view at source ↗
Figure 3
Figure 3. Figure 3: OA attention mask. Block￾causal across frames; within-frame slots are bidirectional (red diagonal). WK reads only addrk (first 32 dims). The 7B trunk is a Chameleon-style multimodal autore￾gressive transformer (32 layers, hidden dimension 4096, 32 attention heads). At slot-typed positions, the standard self-attention is replaced by a slot-aware variant in which the key-projection input is restricted to the… view at source ↗
Figure 4
Figure 4. Figure 4: Main results. Left: LIBERO-Plus radar over the seven perturbation axes (Tab. 2); Right: SimplerEnv WidowX (Bridge) per-task success (Tab. 1). OA-WAM sets a new SOTA on the geometric LIBERO-Plus axes (Geo-Avg 84.3, +4.8% over π0.5) and on SimplerEnv (79.3 avg). 4.2 Main results Standard benchmarks. On LIBERO and SimplerEnv ( view at source ↗
Figure 5
Figure 5. Figure 5: Mechanism diagnostics (A1, A2). (a) LP-camera success vs. camera-shift angle ∆θ: V0 (full OA-WAM) and V1 (key mask off) overlap in-distribution and split as ∆θ grows. (b) Role-query attention from r1-4 (target/reference/tool/distractor) over slot types, averaged over 300 LIBERO￾Spatial episodes. (c) End-effector trajectory under an A2 address swap: OA-WAM deflects toward the swapped target, the holistic ba… view at source ↗
Figure 6
Figure 6. Figure 6 view at source ↗
Figure 7
Figure 7. Figure 7: Representative tasks from the four LIBERO suites [ view at source ↗
Figure 8
Figure 8. Figure 8: Representative tasks from the four SimplerEnv WidowX (Bridge) suites [ view at source ↗
Figure 9
Figure 9. Figure 9: LIBERO-Plus perturbation gallery. We visualize the seven perturbation axes of LIBERO￾Plus (columns: Objects Layout, Background Textures, Light Conditions, Camera Viewpoints, Robot Initial States, Language Instructions, Sensor Noise) applied to four LIBERO suites (rows: Spatial, Object, Goal, Long-horizon). Each row anchors on a single base scene so that the seven panels in that row share identical objects … view at source ↗
read the original abstract

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes OA-WAM, an Object-Addressable World Action Model that decomposes each frame into N+1 slots (one robot slot plus N object slots). Each slot consists of a persistent address vector and a time-varying content vector; these are fused with text, image, proprioception, and action tokens in a block-causal transformer. Cross-slot attention is routed exclusively through address keys with address slices reset per layer. A world head predicts next-frame slot states while a flow-matching head decodes 16-step action chunks. The model reports 97.8% success on LIBERO, 79.3% on SimplerEnv, state-of-the-art results on selected LIBERO-Plus geometric axes, and a swap-binding cosine of 0.87 (versus at most 0.09 for holistic baselines) in a causal slot-intervention test.

Significance. If the address vectors prove stable and separable without explicit binding supervision and generalize beyond the training distribution, the approach would supply a concrete, addressable interface for object-specific action decoding inside world-action models. This could improve robustness to scene perturbations compared with holistic image or latent representations. The causal slot-intervention test and the swap-binding cosine metric constitute a useful, falsifiable evaluation protocol for addressability that future work can build upon.

major comments (3)
  1. [§3.1] §3.1 (Architecture description): The model fixes N object slots as a training hyperparameter and resets the address slice at every transformer layer while routing attention only through address keys. No analysis or experiment demonstrates that this separation survives when the number of objects in a scene differs from the training distribution; slot merging or splitting would directly undermine the claimed address-content decoupling.
  2. [§4.3] §4.3 (Causal slot-intervention test) and results tables: The reported swap-binding cosine of 0.87 is obtained inside the training distribution of object counts and types. The test therefore does not probe whether address vectors remain stable and separable under the very scene perturbations (variable object cardinality, novel object types) that the abstract claims the model handles robustly.
  3. [Results] Results section and abstract: Performance figures (97.8% LIBERO, 79.3% SimplerEnv) are given without error bars, standard deviations, or ablations on slot count N and training losses. This absence makes it impossible to determine whether the gains on LIBERO-Plus geometric axes are statistically reliable or attributable to the address mechanism rather than other modeling choices.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'most relevant LIBERO-Plus geometric axes' is not defined; the manuscript should list the specific axes and the exact scores achieved on them.
  2. [§3] Notation: The distinction between 'persistent address vector' and 'time-varying content vector' is introduced in the abstract and §3 but would benefit from an explicit equation or diagram showing how the two vectors are concatenated or separated inside each slot state.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on the design and evaluation of OA-WAM while indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Architecture description): The model fixes N object slots as a training hyperparameter and resets the address slice at every transformer layer while routing attention only through address keys. No analysis or experiment demonstrates that this separation survives when the number of objects in a scene differs from the training distribution; slot merging or splitting would directly undermine the claimed address-content decoupling.

    Authors: We appreciate the referee pointing out the implications of fixing N as a hyperparameter. In OA-WAM, N is selected to be larger than the maximum object count in the training data, with inactive slots assigned distinct address vectors but near-zero content vectors that do not participate meaningfully in attention or prediction. The address-only key routing and per-layer reset are intended to maintain separation regardless of which slots are active. While the reported benchmarks include scene variations that implicitly affect object presence, we did not explicitly evaluate on scenes with object cardinalities far outside the training range. We will add a targeted experiment in the revision (new subsection in §4) testing variable object counts via slot masking/padding and measuring the resulting swap-binding cosine and task performance to directly validate the decoupling under such shifts. revision: yes

  2. Referee: [§4.3] §4.3 (Causal slot-intervention test) and results tables: The reported swap-binding cosine of 0.87 is obtained inside the training distribution of object counts and types. The test therefore does not probe whether address vectors remain stable and separable under the very scene perturbations (variable object cardinality, novel object types) that the abstract claims the model handles robustly.

    Authors: The causal slot-intervention test and swap-binding cosine metric are designed to provide a controlled, falsifiable probe of address-content decoupling by measuring whether address vectors can be causally swapped while preserving object-specific predictions. This evaluation is intentionally performed within the training distribution to isolate the binding property without confounding factors from distribution shift. Robustness to scene perturbations (including geometric changes and novel configurations) is instead demonstrated via the end-to-end results on LIBERO-Plus and SimplerEnv. We will revise §4.3 and the abstract to explicitly clarify the distinct roles of the intervention test versus the benchmark evaluations, and add a limitations paragraph noting the current scope of the test while emphasizing that address stability under broader perturbations remains an important direction for future work. revision: partial

  3. Referee: [Results] Results section and abstract: Performance figures (97.8% LIBERO, 79.3% SimplerEnv) are given without error bars, standard deviations, or ablations on slot count N and training losses. This absence makes it impossible to determine whether the gains on LIBERO-Plus geometric axes are statistically reliable or attributable to the address mechanism rather than other modeling choices.

    Authors: We agree that the absence of error bars, standard deviations, and targeted ablations limits the ability to assess statistical reliability and isolate the contribution of the address mechanism. In the revised manuscript we will report all main results with standard deviations over multiple random seeds (minimum three runs) and include error bars in the tables and figures. We will also add an ablation study on slot count N (testing values both below and above the chosen hyperparameter) and on the relative weighting of the world-head prediction loss versus the action head, with results and analysis placed in the main text or supplementary material as appropriate. These changes will allow readers to better attribute performance gains to the object-addressable design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal or empirical claims

full rationale

The paper proposes OA-WAM as an architectural extension to world action models, defining slot states with persistent address vectors and content vectors, then enforcing separation via address-only attention keys and per-layer address resets. It reports empirical results on external benchmarks (LIBERO, SimplerEnv) and a newly introduced causal slot-intervention test with a swap-binding cosine metric. No derivation chain reduces any claimed result to its inputs by construction: the performance numbers and cosine value are measured outcomes, not algebraic identities or refitted parameters renamed as predictions. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and the metric is presented as an independent diagnostic rather than tautological. The central claim therefore rests on observable benchmark behavior and comparative testing rather than logical self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The approach assumes that object identity can be factored into a persistent address and time-varying content without additional losses or supervision; this factorization is introduced by the paper rather than derived from prior results.

free parameters (1)
  • N (number of object slots)
    Chosen to match expected scene complexity; value not stated in abstract but required for the decomposition.
axioms (1)
  • domain assumption Object identity remains factorizable into address and content across scene perturbations
    Invoked to justify the slot design and intervention test.
invented entities (2)
  • Persistent address vector no independent evidence
    purpose: To identify which object to act on independently of its current visual state
    New representational primitive introduced to solve entanglement problem
  • Time-varying content vector no independent evidence
    purpose: To capture current state of each addressed object
    Paired with address vector to separate identity from appearance

pith-pipeline@v0.9.0 · 5629 in / 1479 out tokens · 38451 ms · 2026-05-08T08:47:56.032206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 100 canonical work pages · 48 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Focusing on what matters: Object-Agent-centric Tokenization for Vision Language Action models.arXiv preprint arXiv:2509.23655, 2025

    Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, and Pietro Mazzaglia. Focusing on what matters: Object-Agent-centric Tokenization for Vision Language Action models.arXiv preprint arXiv:2509.23655, 2025

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A Vision-Language-Action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. arXiv:2410.24164

  4. [4]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Amael Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth Pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    MONet: Unsupervised Scene Decomposition and Representation

    Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. MONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019

  7. [7]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chai- tanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment Anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  8. [8]

    Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294,

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. arXiv:2104.14294

  9. [9]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. RynnVLA-002: A unified Vision-Language-Action and world model.arXiv preprint arXiv:2511.17502, 2025

  10. [10]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. WorldVLA: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

  11. [11]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  12. [12]

    Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

    Alexandre Chapin, Emmanuel Dellandréa, and Liming Chen. STORM: Slot-based task-aware object- centric representation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

  13. [13]

    Goal-VLA: Image-generative VLMs as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

    Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, and Lin Shao. Goal-VLA: Image-generative VLMs as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

  14. [14]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, et al. InternVLA-M1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

  15. [15]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv:2303.04137

  16. [16]

    arXiv preprint arXiv:2302.00111 , year=

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2302.00111

  17. [17]

    Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C

    Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, and Thomas Kipf. SA Vi++: Towards end-to-end object-centric learning from real-world videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2206.07764

  18. [18]

    arXiv , Author =:1907.13052 , Primaryclass =

    Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and Ingmar Posner. GENESIS: Generative scene inference and sampling with object-centric latent representations. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1907.13052. 10

  19. [19]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus: In-depth robustness analysis of Vision-Language-Action models. arXiv preprint arXiv:2510.13626, 2025

  20. [20]

    FOCUS: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427, 2023

    Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. FOCUS: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427, 2023

  21. [21]

    Barry, Kris Kitani, and George Konidaris

    Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, and George Konidaris. NovaPlan: Zero-shot long-horizon manipulation via closed-loop video language planning. arXiv preprint arXiv:2602.20119, 2026

  22. [22]

    arXiv preprint arXiv:1903.00450 , Title =

    Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. InInternational Conference on Machine Learning (ICML), 2019. arXiv:1903.00450

  23. [23]

    On robustness of Vision-Language-Action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

    Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, et al. On robustness of Vision-Language-Action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

  24. [24]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.Nature, 2025. arXiv:2301.04104

  25. [25]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2310.16828

  26. [26]

    SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

    Taisei Hanyu, Nhat Chung, Huy Le, Toan Nguyen, Yuki Ikebe, Anthony Gunderman, Duy Ho Minh Nguyen, Khoa V o, Tung Kieu, Kashu Yamazaki, et al. SlotVLA: Towards modeling of object-relation representations in robotic manipulation.arXiv preprint arXiv:2511.06754, 2025

  27. [27]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. ThinkAct: Vision-language-action reasoning via reinforced visual latent planning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2507.16815

  28. [28]

    Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. NORA: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

  29. [29]

    Object-centric world model for language- guided manipulation.arXiv preprint arXiv:2503.06170, 2025

    Youngjoon Jeong, Junha Chun, Soonwoo Cha, and Taesup Kim. Object-centric world model for language- guided manipulation.arXiv preprint arXiv:2503.06170, 2025

  30. [30]

    VIMA : General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: General robot manipulation with multimodal prompts. InInternational Conference on Machine Learning (ICML), 2023. arXiv:2210.03094

  31. [31]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems (RSS), 2024. arXiv:2403.12945

  32. [32]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source Vision-Language- Action model.arXiv preprint arXiv:2406.09246, 2024

  33. [33]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning Vision-Language-Action models: Optimizing speed and success. InRobotics: Science and Systems (RSS), 2025. arXiv:2502.19645

  34. [34]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos Policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  35. [35]

    Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff

    Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff. Conditional object-centric learning from video. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2111.12594

  36. [36]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. arXiv:2304.02643

  37. [37]

    What matters when building vision-language models?, 2024

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?arXiv preprint arXiv:2405.02246, 2024. 11

  38. [38]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  39. [39]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  40. [40]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning (CoRL), 2024. arXiv:2405.05941

  41. [41]

    ManipDreamer: Boosting robotic manipulation world model with action tree and visual guidance.arXiv preprint arXiv:2504.16464, 2025

    Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, and Shanghang Zhang. ManipDreamer: Boosting robotic manipulation world model with action tree and visual guidance.arXiv preprint arXiv:2504.16464, 2025

  42. [42]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, et al. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  43. [43]

    HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

    Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, Ziang Li, Chaodong Huang, Hongzhe Bi, Lichao Huang, and Zhizhong Su. HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

  44. [44]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Match- ing for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2210.02747

  45. [45]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.03310

  46. [46]

    arXiv preprint arXiv:2408.02657 (2024) 1

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

  47. [47]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485

  48. [48]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  49. [49]

    World action verifier: Self-improving world models via forward-inverse asymmetry

    Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. arXiv preprint arXiv:2604.01985, 2026

  50. [50]

    Object-centric learning with slot attention

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.15055

  51. [51]

    Being-H0.7: A Latent World-Action Model from Egocentric Videos

    Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-H0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

  52. [52]

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, et al. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  53. [53]

    Unify- ing perception and action: A hybrid-modality pipeline with implicit visual chain-of-thought for robotic action generation.arXiv preprint arXiv:2511.19859,

    Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid-modality pipeline with implicit visual chain-of-thought for robotic action generation.arXiv preprint arXiv:2511.19859, 2025

  54. [54]

    SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822, 2024

    Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822, 2024

  55. [55]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024. 12

  56. [56]

    GR00T N1.5: An improved open foundation model for generalist humanoid robots

    NVIDIA GEAR Team. GR00T N1.5: An improved open foundation model for generalist humanoid robots. NVIDIA Research Blog, June 2025.https://research.nvidia.com/labs/gear/gr00t-n1_5/

  57. [57]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024. arXiv:2405.12213

  58. [58]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open X- Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

  59. [59]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024. arXiv:2304.07193

  60. [60]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a Vision-Language-Action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  61. [61]

    arXiv preprint arXiv:2402.08191

    Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. THE COLOSSEUM: A benchmark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191, 2024

  62. [62]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, Jiayuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for Visual-Language- Action model.arXiv preprint arXiv:2501.15830, 2025

  63. [63]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021. arXiv:2103.00020

  64. [64]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020. arXiv:1910.10683

  65. [65]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment Anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  66. [66]

    Memoryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.ArXiv, abs/2508.19236, 2025

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, et al. MemoryVLA: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  67. [67]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manipulation. InConference on Robot Learning (CoRL), 2021. arXiv:2109.12098

  68. [68]

    Perceiver-actor: A multi-task trans- former for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022. arXiv:2209.05451

  69. [69]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. SmolVLA: A Vision-Language- Action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  70. [70]

    DINOv3

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025

  71. [71]

    arXiv preprint arXiv:2205.14065 , year=

    Gautam Singh, Yi-Fu Wu, and Sungjin Ahn. Simple unsupervised object-centric learning for com- plex and naturalistic videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.14065

  72. [72]

    World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

    Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

  73. [73]

    Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, et al. VLA-JEPA: Enhancing vision- language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026. 13

  74. [74]

    Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding

    Khoa V o, Taisei Hanyu, Yuki Ikebe, Trong Thang Pham, Nhat Chung, Minh Nhat Vu, Duy Ho Minh Nguyen, Anh Nguyen, Anthony Gunderman, Chase Rainwater, and Ngan Le. Clutter- robust Vision-Language-Action models through object-centric and geometry grounding.arXiv preprint arXiv:2512.22519, 2025

  75. [75]

    Bridgedata v2: A dataset for robot learning at scale, 2024

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. arXiv:2308.12952

  76. [76]

    LIBERO-X: Robustness litmus for vision-language-action models

    Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, and Xinmin Liu. LIBERO-X: Robustness litmus for Vision-Language-Action models.arXiv preprint arXiv:2602.06556, 2026

  77. [77]

    Vggt: Visual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.11651

  78. [78]

    OmniTokenizer: A joint image-video tokenizer for visual generation

    Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. OmniTokenizer: A joint image-video tokenizer for visual generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.09399

  79. [79]

    Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified Vision-Language-Action model.arXiv preprint arXiv:2506.19850, 2025

  80. [80]

    FoundationPose: Unified 6D pose estimation and tracking of novel objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.08344

Showing first 80 references.