arxiv: 2605.06481 · v1 · submitted 2026-05-07 · 💻 cs.RO

Recognition: unknown

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Yushan Liu , Peibo Sun , Shoujie Li , Yifan Xie , Lingfeng Zhang , Xintao Chao , Shiyuan Dong , Fang Chen

show 2 more authors

Xiao-Ping Zhang Wenbo Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords object-addressable world modelrobot manipulationvision-language-actionslot-based representationscene perturbation robustnessaddress vectorworld action modelflow-matching action head

0 comments

The pith

Decomposing scenes into object slots with persistent address vectors lets world-action models keep object identities separate from their changing appearances, improving robustness to perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that representing the world as addressable object slots, each carrying a stable identity vector alongside time-varying content, gives an action model a reliable way to refer to specific objects even when the scene shifts. Existing holistic image or latent representations entangle identity with context, so instructions about one object become unreliable after rearrangements. By routing attention through address keys only and resetting the address slice each layer, the model separates which object to act on from what that object currently looks like. If this separation holds, the same forward pass can predict both future slot states and a chunk of continuous actions while staying accurate on geometric manipulation tasks that involve object swaps or displacements.

Core claim

OA-WAM decomposes each frame into N+1 slots (one robot slot plus N object slots), each holding a persistent address vector and a time-varying content vector. These slots are fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts the next-frame slot states while a flow-matching action head decodes a 16-step action chunk in the same pass. Addressability is enforced by using address-only keys for cross-slot attention and resetting the address slice at every transformer layer, which keeps object identity decoupled from current state without extra tokens.

What carries the argument

Object slots that each store a persistent address vector for identity and a separate content vector for appearance, with cross-slot attention routed exclusively through the address keys and the address slice reset per layer.

If this is right

On LIBERO and SimplerEnv benchmarks the model matches or exceeds strong VLA and WAM baselines, with particular gains on geometric axes that require precise object reference.
The same architecture produces a swap-binding cosine of 0.87, far higher than the 0.09 ceiling of holistic baselines, showing that addressable slots preserve identity under perturbation.
A single forward pass jointly predicts next world states and action chunks, so no separate planning stage is required.
The slot count N is fixed at training time, yet performance holds on scenes whose object counts and types stay within the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If address vectors generalize beyond training object counts, the same mechanism could support open-vocabulary instructions that mention previously unseen objects by description alone.
Resetting the address slice each layer may also reduce interference when multiple objects are referenced in one instruction, suggesting a path to multi-object sequential tasks.
The separation of identity from content could be tested in real-robot settings by physically rearranging objects mid-episode and checking whether the policy follows the original address or the new visual content.

Load-bearing premise

The learned address vectors remain stable and separable across time steps and scene interventions even without explicit binding supervision.

What would settle it

Run the causal slot-intervention test on a new set of scenes: swap two objects after the first frame and measure whether the model still binds actions to the original address vector rather than the swapped content; a drop below 0.5 cosine similarity would falsify the claim of stable addressability.

Figures

Figures reproduced from arXiv: 2605.06481 by Fang Chen, Lingfeng Zhang, Peibo Sun, Shiyuan Dong, Shoujie Li, Wenbo Ding, Xiao-Ping Zhang, Xintao Chao, Yifan Xie, Yushan Liu.

**Figure 1.** Figure 1: Overview of OA-WAM. Under scene perturbations (left, six typical axes), holistic WAMs entangle target identity with context in global tokens and drift to wrong actions (top right). Our OA-WAM (bottom right) decomposes each frame into N+1 addressable object slots whose cross-slot attention key reads only the identity address subvector, keeping robust manipulation. (VLA) policies have made rapid progress on … view at source ↗

**Figure 2.** Figure 2: OA-WAM architecture. Multi-modal inputs are encoded into each token streams: objectslot tokens via SAM3+DINOv3, projected by a learnable slot adapter; Only slot tokens introduce learnable parameters; the others reuse frozen embed_tokens. Tokens are assembled into a blockcausal sequence terminated by a learnable action query [ACT-Q] and processed by the slot-aware backbone. The world head reads slot hidde… view at source ↗

**Figure 3.** Figure 3: OA attention mask. Blockcausal across frames; within-frame slots are bidirectional (red diagonal). WK reads only addrk (first 32 dims). The 7B trunk is a Chameleon-style multimodal autoregressive transformer (32 layers, hidden dimension 4096, 32 attention heads). At slot-typed positions, the standard self-attention is replaced by a slot-aware variant in which the key-projection input is restricted to the… view at source ↗

**Figure 4.** Figure 4: Main results. Left: LIBERO-Plus radar over the seven perturbation axes (Tab. 2); Right: SimplerEnv WidowX (Bridge) per-task success (Tab. 1). OA-WAM sets a new SOTA on the geometric LIBERO-Plus axes (Geo-Avg 84.3, +4.8% over π0.5) and on SimplerEnv (79.3 avg). 4.2 Main results Standard benchmarks. On LIBERO and SimplerEnv ( view at source ↗

**Figure 5.** Figure 5: Mechanism diagnostics (A1, A2). (a) LP-camera success vs. camera-shift angle ∆θ: V0 (full OA-WAM) and V1 (key mask off) overlap in-distribution and split as ∆θ grows. (b) Role-query attention from r1-4 (target/reference/tool/distractor) over slot types, averaged over 300 LIBEROSpatial episodes. (c) End-effector trajectory under an A2 address swap: OA-WAM deflects toward the swapped target, the holistic ba… view at source ↗

**Figure 7.** Figure 7: Representative tasks from the four LIBERO suites [ view at source ↗

**Figure 8.** Figure 8: Representative tasks from the four SimplerEnv WidowX (Bridge) suites [ view at source ↗

**Figure 9.** Figure 9: LIBERO-Plus perturbation gallery. We visualize the seven perturbation axes of LIBEROPlus (columns: Objects Layout, Background Textures, Light Conditions, Camera Viewpoints, Robot Initial States, Language Instructions, Sensor Noise) applied to four LIBERO suites (rows: Spatial, Object, Goal, Long-horizon). Each row anchors on a single base scene so that the seven panels in that row share identical objects … view at source ↗

read the original abstract

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OA-WAM adds persistent address vectors to fixed object slots in a world-action model, which separates identity from content enough to beat holistic baselines on a swap test but leaves variable object counts untested.

read the letter

The main thing to know is that this paper decomposes scenes into a fixed number of slots, each with a persistent address vector and a changing content vector, then routes attention only through the addresses while resetting them every layer. This is meant to let the model handle instructions about specific objects even when the scene shifts. They run world prediction and a flow-matching action head in the same causal pass. The combination of address-only keys, per-layer resets, and joint heads looks new relative to the holistic WAM work they cite. On the benchmarks it matches or exceeds strong VLA and WAM baselines on LIBERO and SimplerEnv, with some gains on the geometric axes of LIBERO-Plus, and the custom slot-intervention test gives a swap-binding cosine of 0.87 versus at most 0.09 for the baselines. That metric is a reasonable way to check whether object bindings survive perturbations inside the training distribution. The soft spot is the fixed slot count N. The architecture hard-codes one robot slot plus N object slots, so any scene with a different object count forces merging or splitting. The reported results stay inside the training distribution of object numbers and types, and there is no ablation or OOD test on variable counts. Without that, it is hard to know whether the address stability holds under the very scene changes the model is supposed to handle. Details on error bars, slot-count sensitivity, and training losses are also thin. This is for robotics groups working on world models and vision-language-action policies that need object-level addressability. Readers who want a practical interface for referring to particular items will find the architecture and the intervention test useful. The work shows clear thinking on the binding problem and has enough empirical grounding to deserve a serious referee, even if the generalization questions need more data. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes OA-WAM, an Object-Addressable World Action Model that decomposes each frame into N+1 slots (one robot slot plus N object slots). Each slot consists of a persistent address vector and a time-varying content vector; these are fused with text, image, proprioception, and action tokens in a block-causal transformer. Cross-slot attention is routed exclusively through address keys with address slices reset per layer. A world head predicts next-frame slot states while a flow-matching head decodes 16-step action chunks. The model reports 97.8% success on LIBERO, 79.3% on SimplerEnv, state-of-the-art results on selected LIBERO-Plus geometric axes, and a swap-binding cosine of 0.87 (versus at most 0.09 for holistic baselines) in a causal slot-intervention test.

Significance. If the address vectors prove stable and separable without explicit binding supervision and generalize beyond the training distribution, the approach would supply a concrete, addressable interface for object-specific action decoding inside world-action models. This could improve robustness to scene perturbations compared with holistic image or latent representations. The causal slot-intervention test and the swap-binding cosine metric constitute a useful, falsifiable evaluation protocol for addressability that future work can build upon.

major comments (3)

[§3.1] §3.1 (Architecture description): The model fixes N object slots as a training hyperparameter and resets the address slice at every transformer layer while routing attention only through address keys. No analysis or experiment demonstrates that this separation survives when the number of objects in a scene differs from the training distribution; slot merging or splitting would directly undermine the claimed address-content decoupling.
[§4.3] §4.3 (Causal slot-intervention test) and results tables: The reported swap-binding cosine of 0.87 is obtained inside the training distribution of object counts and types. The test therefore does not probe whether address vectors remain stable and separable under the very scene perturbations (variable object cardinality, novel object types) that the abstract claims the model handles robustly.
[Results] Results section and abstract: Performance figures (97.8% LIBERO, 79.3% SimplerEnv) are given without error bars, standard deviations, or ablations on slot count N and training losses. This absence makes it impossible to determine whether the gains on LIBERO-Plus geometric axes are statistically reliable or attributable to the address mechanism rather than other modeling choices.

minor comments (2)

[Abstract] Abstract: The phrase 'most relevant LIBERO-Plus geometric axes' is not defined; the manuscript should list the specific axes and the exact scores achieved on them.
[§3] Notation: The distinction between 'persistent address vector' and 'time-varying content vector' is introduced in the abstract and §3 but would benefit from an explicit equation or diagram showing how the two vectors are concatenated or separated inside each slot state.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on the design and evaluation of OA-WAM while indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [§3.1] §3.1 (Architecture description): The model fixes N object slots as a training hyperparameter and resets the address slice at every transformer layer while routing attention only through address keys. No analysis or experiment demonstrates that this separation survives when the number of objects in a scene differs from the training distribution; slot merging or splitting would directly undermine the claimed address-content decoupling.

Authors: We appreciate the referee pointing out the implications of fixing N as a hyperparameter. In OA-WAM, N is selected to be larger than the maximum object count in the training data, with inactive slots assigned distinct address vectors but near-zero content vectors that do not participate meaningfully in attention or prediction. The address-only key routing and per-layer reset are intended to maintain separation regardless of which slots are active. While the reported benchmarks include scene variations that implicitly affect object presence, we did not explicitly evaluate on scenes with object cardinalities far outside the training range. We will add a targeted experiment in the revision (new subsection in §4) testing variable object counts via slot masking/padding and measuring the resulting swap-binding cosine and task performance to directly validate the decoupling under such shifts. revision: yes
Referee: [§4.3] §4.3 (Causal slot-intervention test) and results tables: The reported swap-binding cosine of 0.87 is obtained inside the training distribution of object counts and types. The test therefore does not probe whether address vectors remain stable and separable under the very scene perturbations (variable object cardinality, novel object types) that the abstract claims the model handles robustly.

Authors: The causal slot-intervention test and swap-binding cosine metric are designed to provide a controlled, falsifiable probe of address-content decoupling by measuring whether address vectors can be causally swapped while preserving object-specific predictions. This evaluation is intentionally performed within the training distribution to isolate the binding property without confounding factors from distribution shift. Robustness to scene perturbations (including geometric changes and novel configurations) is instead demonstrated via the end-to-end results on LIBERO-Plus and SimplerEnv. We will revise §4.3 and the abstract to explicitly clarify the distinct roles of the intervention test versus the benchmark evaluations, and add a limitations paragraph noting the current scope of the test while emphasizing that address stability under broader perturbations remains an important direction for future work. revision: partial
Referee: [Results] Results section and abstract: Performance figures (97.8% LIBERO, 79.3% SimplerEnv) are given without error bars, standard deviations, or ablations on slot count N and training losses. This absence makes it impossible to determine whether the gains on LIBERO-Plus geometric axes are statistically reliable or attributable to the address mechanism rather than other modeling choices.

Authors: We agree that the absence of error bars, standard deviations, and targeted ablations limits the ability to assess statistical reliability and isolate the contribution of the address mechanism. In the revised manuscript we will report all main results with standard deviations over multiple random seeds (minimum three runs) and include error bars in the tables and figures. We will also add an ablation study on slot count N (testing values both below and above the chosen hyperparameter) and on the relative weighting of the world-head prediction loss versus the action head, with results and analysis placed in the main text or supplementary material as appropriate. These changes will allow readers to better attribute performance gains to the object-addressable design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal or empirical claims

full rationale

The paper proposes OA-WAM as an architectural extension to world action models, defining slot states with persistent address vectors and content vectors, then enforcing separation via address-only attention keys and per-layer address resets. It reports empirical results on external benchmarks (LIBERO, SimplerEnv) and a newly introduced causal slot-intervention test with a swap-binding cosine metric. No derivation chain reduces any claimed result to its inputs by construction: the performance numbers and cosine value are measured outcomes, not algebraic identities or refitted parameters renamed as predictions. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and the metric is presented as an independent diagnostic rather than tautological. The central claim therefore rests on observable benchmark behavior and comparative testing rather than logical self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The approach assumes that object identity can be factored into a persistent address and time-varying content without additional losses or supervision; this factorization is introduced by the paper rather than derived from prior results.

free parameters (1)

N (number of object slots)
Chosen to match expected scene complexity; value not stated in abstract but required for the decomposition.

axioms (1)

domain assumption Object identity remains factorizable into address and content across scene perturbations
Invoked to justify the slot design and intervention test.

invented entities (2)

Persistent address vector no independent evidence
purpose: To identify which object to act on independently of its current visual state
New representational primitive introduced to solve entanglement problem
Time-varying content vector no independent evidence
purpose: To capture current state of each addressed object
Paired with address vector to separate identity from appearance

pith-pipeline@v0.9.0 · 5629 in / 1479 out tokens · 38451 ms · 2026-05-08T08:47:56.032206+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

102 extracted references · 100 canonical work pages · 48 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review arXiv 2025
[2]

Focusing on what matters: Object-Agent-centric Tokenization for Vision Language Action models.arXiv preprint arXiv:2509.23655, 2025

Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, and Pietro Mazzaglia. Focusing on what matters: Object-Agent-centric Tokenization for Vision Language Action models.arXiv preprint arXiv:2509.23655, 2025

work page arXiv 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A Vision-Language-Action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. arXiv:2410.24164

work page internal anchor Pith review arXiv 2025
[4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Amael Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth Pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

work page internal anchor Pith review arXiv 2024
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review arXiv 2022
[6]

MONet: Unsupervised Scene Decomposition and Representation

Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. MONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019

work page Pith review arXiv 1901
[7]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chai- tanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment Anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review arXiv 2025
[8]

Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294,

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. arXiv:2104.14294

work page arXiv 2021
[9]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. RynnVLA-002: A unified Vision-Language-Action and world model.arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025
[10]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. WorldVLA: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[11]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024
[12]

Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

Alexandre Chapin, Emmanuel Dellandréa, and Liming Chen. STORM: Slot-based task-aware object- centric representation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

work page arXiv 2026
[13]

Goal-VLA: Image-generative VLMs as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, and Lin Shao. Goal-VLA: Image-generative VLMs as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

work page arXiv 2025
[14]

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, et al. InternVLA-M1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review arXiv 2025
[15]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv:2303.04137

work page internal anchor Pith review arXiv 2023
[16]

arXiv preprint arXiv:2302.00111 , year=

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2302.00111

work page arXiv 2023
[17]

Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C

Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, and Thomas Kipf. SA Vi++: Towards end-to-end object-centric learning from real-world videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2206.07764

work page arXiv 2022
[18]

arXiv , Author =:1907.13052 , Primaryclass =

Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and Ingmar Posner. GENESIS: Generative scene inference and sampling with object-centric latent representations. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1907.13052. 10

work page arXiv 2020
[19]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus: In-depth robustness analysis of Vision-Language-Action models. arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review arXiv 2025
[20]

FOCUS: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427, 2023

Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. FOCUS: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427, 2023

work page arXiv 2023
[21]

Barry, Kris Kitani, and George Konidaris

Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, and George Konidaris. NovaPlan: Zero-shot long-horizon manipulation via closed-loop video language planning. arXiv preprint arXiv:2602.20119, 2026

work page arXiv 2026
[22]

arXiv preprint arXiv:1903.00450 , Title =

Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. InInternational Conference on Machine Learning (ICML), 2019. arXiv:1903.00450

work page arXiv 2019
[23]

On robustness of Vision-Language-Action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, et al. On robustness of Vision-Language-Action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

work page arXiv 2025
[24]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.Nature, 2025. arXiv:2301.04104

work page internal anchor Pith review arXiv 2025
[25]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2310.16828

work page internal anchor Pith review arXiv 2024
[26]

SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

Taisei Hanyu, Nhat Chung, Huy Le, Toan Nguyen, Yuki Ikebe, Anthony Gunderman, Duy Ho Minh Nguyen, Khoa V o, Tung Kieu, Kashu Yamazaki, et al. SlotVLA: Towards modeling of object-relation representations in robotic manipulation.arXiv preprint arXiv:2511.06754, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. ThinkAct: Vision-language-action reasoning via reinforced visual latent planning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2507.16815

work page arXiv 2025
[28]

Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. NORA: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

work page arXiv 2025
[29]

Object-centric world model for language- guided manipulation.arXiv preprint arXiv:2503.06170, 2025

Youngjoon Jeong, Junha Chun, Soonwoo Cha, and Taesup Kim. Object-centric world model for language- guided manipulation.arXiv preprint arXiv:2503.06170, 2025

work page arXiv 2025
[30]

VIMA : General robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: General robot manipulation with multimodal prompts. InInternational Conference on Machine Learning (ICML), 2023. arXiv:2210.03094

work page arXiv 2023
[31]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems (RSS), 2024. arXiv:2403.12945

work page internal anchor Pith review arXiv 2024
[32]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source Vision-Language- Action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[33]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning Vision-Language-Action models: Optimizing speed and success. InRobotics: Science and Systems (RSS), 2025. arXiv:2502.19645

work page internal anchor Pith review arXiv 2025
[34]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos Policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review arXiv 2026
[35]

Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff

Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff. Conditional object-centric learning from video. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2111.12594

work page arXiv 2022
[36]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. arXiv:2304.02643

work page internal anchor Pith review arXiv 2023
[37]

What matters when building vision-language models?, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?arXiv preprint arXiv:2405.02246, 2024. 11

work page arXiv 2024
[38]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review arXiv 2026
[39]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page Pith review arXiv 2024
[40]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning (CoRL), 2024. arXiv:2405.05941

work page internal anchor Pith review arXiv 2024
[41]

ManipDreamer: Boosting robotic manipulation world model with action tree and visual guidance.arXiv preprint arXiv:2504.16464, 2025

Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, and Shanghang Zhang. ManipDreamer: Boosting robotic manipulation world model with action tree and visual guidance.arXiv preprint arXiv:2504.16464, 2025

work page arXiv 2025
[42]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, et al. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page arXiv 2025
[43]

HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, Ziang Li, Chaodong Huang, Hongzhe Bi, Lichao Huang, and Zhizhong Su. HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

work page arXiv 2026
[44]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Match- ing for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2210.02747

work page internal anchor Pith review arXiv 2023
[45]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.03310

work page internal anchor Pith review arXiv 2023
[46]

arXiv preprint arXiv:2408.02657 (2024) 1

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

work page arXiv 2024
[47]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485

work page internal anchor Pith review arXiv 2023
[48]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review arXiv 2024
[49]

World action verifier: Self-improving world models via forward-inverse asymmetry

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. arXiv preprint arXiv:2604.01985, 2026

work page arXiv 2026
[50]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.15055

work page arXiv 2020
[51]

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-H0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, et al. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

work page arXiv 2025
[53]

Unify- ing perception and action: A hybrid-modality pipeline with implicit visual chain-of-thought for robotic action generation.arXiv preprint arXiv:2511.19859,

Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid-modality pipeline with implicit visual chain-of-thought for robotic action generation.arXiv preprint arXiv:2511.19859, 2025

work page arXiv 2025
[54]

SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822, 2024

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822, 2024

work page arXiv 2024
[55]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024. 12

work page internal anchor Pith review arXiv 2024
[56]

GR00T N1.5: An improved open foundation model for generalist humanoid robots

NVIDIA GEAR Team. GR00T N1.5: An improved open foundation model for generalist humanoid robots. NVIDIA Research Blog, June 2025.https://research.nvidia.com/labs/gear/gr00t-n1_5/

2025
[57]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024. arXiv:2405.12213

work page internal anchor Pith review arXiv 2024
[58]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open X- Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review arXiv 2023
[59]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024. arXiv:2304.07193

work page internal anchor Pith review arXiv 2024
[60]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a Vision-Language-Action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review arXiv 2025
[61]

arXiv preprint arXiv:2402.08191

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. THE COLOSSEUM: A benchmark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191, 2024

work page arXiv 2024
[62]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, Jiayuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for Visual-Language- Action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[63]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021. arXiv:2103.00020

work page internal anchor Pith review arXiv 2021
[64]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020. arXiv:1910.10683

work page internal anchor Pith review arXiv 2020
[65]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment Anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[66]

Memoryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.ArXiv, abs/2508.19236, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, et al. MemoryVLA: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page arXiv 2025
[67]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manipulation. InConference on Robot Learning (CoRL), 2021. arXiv:2109.12098

work page arXiv 2021
[68]

Perceiver-actor: A multi-task trans- former for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022. arXiv:2209.05451

work page arXiv 2022
[69]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. SmolVLA: A Vision-Language- Action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review arXiv 2025
[70]

DINOv3

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review arXiv 2025
[71]

arXiv preprint arXiv:2205.14065 , year=

Gautam Singh, Yi-Fu Wu, and Sungjin Ahn. Simple unsupervised object-centric learning for com- plex and naturalistic videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.14065

work page arXiv 2022
[72]

World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

work page arXiv 2026
[73]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, et al. VLA-JEPA: Enhancing vision- language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026. 13

work page arXiv 2026
[74]

Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding

Khoa V o, Taisei Hanyu, Yuki Ikebe, Trong Thang Pham, Nhat Chung, Minh Nhat Vu, Duy Ho Minh Nguyen, Anh Nguyen, Anthony Gunderman, Chase Rainwater, and Ngan Le. Clutter- robust Vision-Language-Action models through object-centric and geometry grounding.arXiv preprint arXiv:2512.22519, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Bridgedata v2: A dataset for robot learning at scale, 2024

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. arXiv:2308.12952

work page arXiv 2023
[76]

LIBERO-X: Robustness litmus for vision-language-action models

Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, and Xinmin Liu. LIBERO-X: Robustness litmus for Vision-Language-Action models.arXiv preprint arXiv:2602.06556, 2026

work page arXiv 2026
[77]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.11651

work page arXiv 2025
[78]

OmniTokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. OmniTokenizer: A joint image-video tokenizer for visual generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.09399

work page arXiv 2024
[79]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified Vision-Language-Action model.arXiv preprint arXiv:2506.19850, 2025

work page arXiv 2025
[80]

FoundationPose: Unified 6D pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.08344

work page arXiv 2024

Showing first 80 references.