RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Junke Wang; Qihang Zhang; Shuai Yang; Yiming Luo; Yinghao Xu; Yu-Gang Jiang; Yujun Shen; Zuxuan Wu

arxiv: 2606.13674 · v1 · pith:OOEKMESGnew · submitted 2026-06-11 · 💻 cs.CV

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Junke Wang , Qihang Zhang , Shuai Yang , Yiming Luo , Yujun Shen , Zuxuan Wu , Yu-Gang Jiang , Yinghao Xu This is my paper

Pith reviewed 2026-06-27 06:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords world action modelsvisual-action tokenizersrobot manipulationrepresentation learninginstruction followingclosed-loop controlsemantic tokenizationlatent actions

0 comments

The pith

A semantic visual-action tokenizer improves world action models by aligning visuals with latent actions for better robot instruction following.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that world action models benefit from training a representation visual-action tokenizer that produces aligned visual and latent action tokens rather than relying on pixel-reconstruction tokenizers from video models. This change supplies direct semantic guidance for jointly predicting future visual states and the actions that connect them under language instructions. The resulting model is pretrained on this joint objective and then adapted to real robot trajectories, yielding strong closed-loop performance on manipulation tasks. A sympathetic reader would care because pixel fidelity alone leaves dynamics learning under-constrained for control, while the new tokenization ties prediction more tightly to actionable outcomes.

Core claim

RepWAM is built on a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens; the model is then pretrained to jointly model future visual states and the latent actions linking them under language instructions before adaptation to real-robot trajectories for closed-loop manipulation.

What carries the argument

The representation visual-action tokenizer, which maps visual inputs into aligned visual and latent action tokens to supply semantic guidance instead of pixel reconstruction.

If this is right

The model achieves strong performance across diverse real-world manipulation tasks and simulation benchmarks.
Ablations confirm that semantic visual-action tokenization outperforms reconstruction-oriented alternatives for dynamics learning.
Representation visual-action tokenization serves as a foundation for world action models that connect prediction to control.
The approach constitutes a step toward generalist robot policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenizer design could be tested on non-manipulation embodied tasks such as navigation or tool use to check whether the alignment benefit transfers.
Replacing reconstruction losses with action-token alignment might reduce the data volume needed for effective pretraining of world models.
Combining the latent action tokens with additional sensory streams such as force or audio could further tighten the link between prediction and control.

Load-bearing premise

Training a representation visual-action tokenizer to map visual inputs into aligned visual and latent action tokens provides substantially better guidance for learning instruction-following dynamics than pixel-reconstruction tokenizers.

What would settle it

If an ablation using the same pretraining and adaptation pipeline but with a standard pixel-reconstruction tokenizer achieves equal or higher success rates on the real-world manipulation tasks and simulation benchmarks, the advantage of semantic visual-action tokenization would be falsified.

read the original abstract

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RepWAM replaces reconstruction tokenizers with one that aligns visual and latent action tokens for joint state-action modeling in world action pretraining.

read the letter

The core idea here is a tokenizer that maps visual inputs to paired visual and latent action tokens, then pretrains the world model to predict both future visuals and the actions that produce them under language instructions before fine-tuning on robot data.

This is a direct response to the limitation that pixel-reconstruction tokenizers from video models give weak signals for control-relevant dynamics. The paper shows the new approach on real manipulation tasks and sim benchmarks, with ablations indicating the semantic alignment helps over standard alternatives. Releasing code and weights is a clear positive for anyone wanting to test it.

The design choice makes logical sense and targets the stated problem without obvious circularity. The experiments appear to control for the tokenizer type, which strengthens the comparison.

Details on the exact tokenizer architecture, loss terms, and specific performance numbers are not in the abstract, so the size of the gains and whether other factors drive them remain open until the full methods are checked. If the ablations are thorough, this stays a minor issue.

The work is for researchers building world models or tokenizers for robot policies. Anyone working on embodied AI or action-conditioned prediction would find the concrete alternative useful.

It should go to peer review. The motivation is coherent, the experiments cover relevant settings, and the open release allows direct checking.

Referee Report

0 major / 1 minor

Summary. The paper introduces RepWAM, a representation-centric world action model (WAM) that trains a semantic visual-action tokenizer to produce aligned visual and latent action tokens from visual inputs. The WAM is pretrained to jointly model future visual states and the connecting latent actions under language instructions, then adapted to real robot trajectories for closed-loop manipulation. The central empirical claim is that this yields strong performance on real-world and simulated manipulation tasks, with ablations demonstrating the superiority of semantic visual-action tokenization over reconstruction-oriented video tokenizers inherited from generation models.

Significance. If the performance claims and ablation results hold under detailed scrutiny, the work addresses a coherent limitation in existing WAMs (limited dynamics signal from pixel reconstruction) with a targeted alternative, potentially advancing generalist robot policies. The explicit commitment to release code and weights supports reproducibility and follow-on work.

minor comments (1)

Abstract: the claims of 'strong performance' and 'ablations highlight the value' are stated without any quantitative metrics, baseline names, or effect sizes, which is standard for an abstract but leaves the magnitude of gains unassessable from the provided text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of RepWAM and for noting its potential contribution to world action models. The recommendation is listed as uncertain, yet the report contains no enumerated major comments. We therefore provide no point-by-point responses below and have no standing objections.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical architecture for training representation visual-action tokenizers, pretraining a world action model on future visual states and latent actions, and adapting to robot trajectories, with performance validated via real-world and simulation experiments plus ablations. No equations, derivations, or load-bearing claims appear in the abstract or described content that reduce by construction to author-defined inputs, fitted parameters renamed as predictions, or self-citation chains. The central motivation (pixel reconstruction provides limited dynamics signal) is addressed by a direct alternative design, with results presented as experimental outcomes rather than logical necessities. This matches the most common honest finding for self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claim rests on the unverified premise that semantic tokenization improves dynamics learning.

pith-pipeline@v0.9.1-grok · 5757 in / 1020 out tokens · 13799 ms · 2026-06-27T06:45:33.178913+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 17 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvĳit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

Worldsimulationwithvideofoundationmodelsforphysicalai

ArslanAli,JunjieBai,MaciejBala,YogeshBalaji,AaronBlakeman,TiffanyCai,JiaxinCao,TianshiCao,ElizabethCha, Yu-WeiChao,etal. Worldsimulationwithvideofoundationmodelsforphysicalai. arXivpreprintarXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[3]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[4]

pi05: a vision-language-action model with open-world generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al. pi05: a vision-language-action model with open-world generalization. InCoRL, 2025

2025
[5]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. NeurIPS, 2026

2026
[6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

2024
[7]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXivpreprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[8]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[9]

Moto: Latent motion token as the bridging language for robot manipulation

Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. InICCV, 2025

2025
[10]

Perceptionlm: Open-access data and models for detailed visual understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and models for detailed visual understanding. InNeurIPS, 2026

2026
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

2009
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[13]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023

2023
[14]

Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models

Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, et al. Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models. arXiv preprintarXiv:2605.10942, 2026

Pith/arXiv arXiv 2026
[15]

Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

arXiv 2025
[16]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InNeurIPS, 2018

2018
[17]

World models.arXiv preprintarXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2018

Pith/arXiv arXiv 2018
[18]

URLhttps://kellerjordan.github.io/posts/muon/

KellerJordan,YuchenJin,VladoBoza,YouJiacheng,FranzCesista,LakerNewhouse,andJeremyBernstein.Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/. 12

2024
[19]

Robointer: A holistic intermediate representation suite towards robotic manipulation

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, et al. Robointer: A holistic intermediate representation suite towards robotic manipulation. InICLR, 2026

2026
[20]

Causal world modeling for robot control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control. InRSS, 2026

2026
[21]

Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, and Wenbo Ding. Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Pith/arXiv arXiv 2026
[22]

Being-h0

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026
[23]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InICLR, 2024

2024
[24]

Ucf101: A dataset of 101 human actions classes from videos in the wild.arXivpreprintarXiv:1212.0402, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXivpreprintarXiv:1212.0402, 2012

Pith/arXiv arXiv 2012
[25]

Motubrain: Anadvancedworldactionmodelforrobotcontrol

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang,KiroJing,etal. Motubrain: Anadvancedworldactionmodelforrobotcontrol. arXivpreprintarXiv:2604.27792, 2026

Pith/arXiv arXiv 2026
[26]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

2020
[27]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

arXiv 2025
[28]

Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[29]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024
[30]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025
[31]

World action models: The next frontier in embodied ai.arXivpreprintarXiv:2605.12090, 2026

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, et al. World action models: The next frontier in embodied ai.arXivpreprintarXiv:2605.12090, 2026

Pith/arXiv arXiv 2026
[32]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprintarXiv:2412.13877, 2024

Pith/arXiv arXiv 2024
[33]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025
[34]

Gigaworld-policy: An efficient action-centered world–action model.arXivpreprintarXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXivpreprintarXiv:2603.17240, 2026

arXiv 2026
[35]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InICLR, 2025

2025
[36]

Worldactionmodelsarezero-shotpolicies

SeonghyeonYe, YunhaoGe,KaiyuanZheng, ShenyuanGao, SihyunYu, GeorgeKurian, SuneelIndupuru, YouLiang Tan, ChuningZhu,JiannanXiang,etal. Worldactionmodelsarezero-shotpolicies. arXivpreprintarXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[37]

Fast-wam: Do world action models need test-time future imagination? arXivpreprintarXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXivpreprintarXiv:2603.16666, 2026. 13

Pith/arXiv arXiv 2026

[1] [1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvĳit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

Worldsimulationwithvideofoundationmodelsforphysicalai

ArslanAli,JunjieBai,MaciejBala,YogeshBalaji,AaronBlakeman,TiffanyCai,JiaxinCao,TianshiCao,ElizabethCha, Yu-WeiChao,etal. Worldsimulationwithvideofoundationmodelsforphysicalai. arXivpreprintarXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[3] [3]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[4] [4]

pi05: a vision-language-action model with open-world generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al. pi05: a vision-language-action model with open-world generalization. InCoRL, 2025

2025

[5] [5]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. NeurIPS, 2026

2026

[6] [6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

2024

[7] [7]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXivpreprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[8] [8]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[9] [9]

Moto: Latent motion token as the bridging language for robot manipulation

Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. InICCV, 2025

2025

[10] [10]

Perceptionlm: Open-access data and models for detailed visual understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and models for detailed visual understanding. InNeurIPS, 2026

2026

[11] [11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

2009

[12] [12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021

[13] [13]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023

2023

[14] [14]

Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models

Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, et al. Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models. arXiv preprintarXiv:2605.10942, 2026

Pith/arXiv arXiv 2026

[15] [15]

Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, and Zuxuan Wu. Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025

arXiv 2025

[16] [16]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InNeurIPS, 2018

2018

[17] [17]

World models.arXiv preprintarXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2018

Pith/arXiv arXiv 2018

[18] [18]

URLhttps://kellerjordan.github.io/posts/muon/

KellerJordan,YuchenJin,VladoBoza,YouJiacheng,FranzCesista,LakerNewhouse,andJeremyBernstein.Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/. 12

2024

[19] [19]

Robointer: A holistic intermediate representation suite towards robotic manipulation

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, et al. Robointer: A holistic intermediate representation suite towards robotic manipulation. InICLR, 2026

2026

[20] [20]

Causal world modeling for robot control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control. InRSS, 2026

2026

[21] [21]

Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Yushan Liu, Peibo Sun, Shoujie Li, Yifan Xie, Lingfeng Zhang, Xintao Chao, Shiyuan Dong, Fang Chen, Xiao-Ping Zhang, and Wenbo Ding. Oa-wam: Object-addressable world action model for robust robot manipulation.arXiv preprint arXiv:2605.06481, 2026

Pith/arXiv arXiv 2026

[22] [22]

Being-h0

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026

[23] [23]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InICLR, 2024

2024

[24] [24]

Ucf101: A dataset of 101 human actions classes from videos in the wild.arXivpreprintarXiv:1212.0402, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXivpreprintarXiv:1212.0402, 2012

Pith/arXiv arXiv 2012

[25] [25]

Motubrain: Anadvancedworldactionmodelforrobotcontrol

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang,KiroJing,etal. Motubrain: Anadvancedworldactionmodelforrobotcontrol. arXivpreprintarXiv:2604.27792, 2026

Pith/arXiv arXiv 2026

[26] [26]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

2020

[27] [27]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

arXiv 2025

[28] [28]

Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[29] [29]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024

[30] [30]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025

[31] [31]

World action models: The next frontier in embodied ai.arXivpreprintarXiv:2605.12090, 2026

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, et al. World action models: The next frontier in embodied ai.arXivpreprintarXiv:2605.12090, 2026

Pith/arXiv arXiv 2026

[32] [32]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprintarXiv:2412.13877, 2024

Pith/arXiv arXiv 2024

[33] [33]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025

[34] [34]

Gigaworld-policy: An efficient action-centered world–action model.arXivpreprintarXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXivpreprintarXiv:2603.17240, 2026

arXiv 2026

[35] [35]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InICLR, 2025

2025

[36] [36]

Worldactionmodelsarezero-shotpolicies

SeonghyeonYe, YunhaoGe,KaiyuanZheng, ShenyuanGao, SihyunYu, GeorgeKurian, SuneelIndupuru, YouLiang Tan, ChuningZhu,JiannanXiang,etal. Worldactionmodelsarezero-shotpolicies. arXivpreprintarXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[37] [37]

Fast-wam: Do world action models need test-time future imagination? arXivpreprintarXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXivpreprintarXiv:2603.16666, 2026. 13

Pith/arXiv arXiv 2026