arxiv: 2604.27711 · v1 · submitted 2026-04-30 · 💻 cs.RO

Recognition: unknown

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

Yanghao Zhou , Jingyu Ma , Yibo Peng , Zhenguo Sun , Yu Bai , B\"orje F. Karlsson

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:18 UTC · model grok-4.3

classification 💻 cs.RO

keywords exocentric video generationhumanoid controlinteraction dynamicsvideo synthesismotion estimationgeneral motion controllertask-conditioned behaviorgeneralizable robotics

0 comments

The pith

ExoActor generates third-person videos of task execution to produce generalizable humanoid robot behaviors from instructions and scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that large-scale video generation models can synthesize exocentric videos showing plausible humanoid interactions with environments and objects, serving as a unified way to capture spatial, temporal, and intentional dynamics at scale. These videos are converted into robot actions by estimating human motion from them and feeding the results into a general motion controller, creating task-specific behavior sequences. This matters because traditional control approaches cannot easily handle the joint modeling of context, actions, and intent without massive real-world supervision. A sympathetic reader would see it as a route to humanoid systems that adapt to new tasks and scenes using only the generalization power of video models, without needing fresh data collection or per-task tuning.

Core claim

Given a task instruction and scene context, ExoActor synthesizes plausible execution processes that implicitly encode coordinated interactions between robot, environment, and objects. Such video output is then transformed into executable humanoid behaviors through a pipeline that estimates human motion and executes it via a general motion controller, yielding a task-conditioned behavior sequence. The framework is implemented as an end-to-end system that demonstrates generalization to new scenarios without additional real-world data collection.

What carries the argument

Exocentric video generation as a unified interface that synthesizes videos implicitly encoding robot-environment-object interaction dynamics, which are then decoded into executable sequences by a motion estimation and general controller pipeline.

If this is right

Humanoid systems can produce coordinated, interaction-rich behaviors for new tasks using only video synthesis from instructions and scene context.
No additional real-world data collection or fine-tuning is required for generalization to unseen scenarios.
Video generation models can serve as a scalable prior for modeling the joint spatial-temporal-intentional aspects of robot actions.
The same pipeline yields task-conditioned behavior sequences that a general motion controller can execute on hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If video models become better at respecting physical constraints, the transfer step could require fewer safety filters or corrections on the robot.
Adding real-time visual feedback from the robot's cameras into the video synthesis loop might reduce mismatches between generated and actual dynamics.
The approach could extend to non-humanoid platforms if their motion controllers accept similar estimated trajectories derived from video.
Current limits in video generation quality, such as inconsistent object physics, would directly limit success rates on contact-heavy tasks.

Load-bearing premise

The generated videos must accurately represent physically feasible interaction dynamics that can be transferred to real humanoid hardware through motion estimation and a general controller.

What would settle it

Run the full pipeline on a physical humanoid in a novel scene with unseen objects, then check whether the extracted behaviors complete the instructed task without falling, colliding, or deviating due to physically implausible dynamics in the synthesized video.

Figures

Figures reproduced from arXiv: 2604.27711 by B\"orje F. Karlsson, Jingyu Ma, Yanghao Zhou, Yibo Peng, Yu Bai, Zhenguo Sun.

**Figure 1.** Figure 1: ExoActor capabilities across tasks of increasing difficulty. Exocentric video generation view at source ↗

**Figure 2.** Figure 2: Overview of the ExoActor framework. Given a task instruction and initial observation, view at source ↗

**Figure 3.** Figure 3: Illustration of robot-to-human embodiment view at source ↗

**Figure 4.** Figure 4: Interaction-aware whole-body motion estimation. Given generated third-person videos (top), our method recovers corresponding 3D human motion trajectories (bottom), capturing both body kinematics and interaction dynamics with the environment as modeled in the generated videos. Whole-body Motion Estimation To obtain high-fidelity human motion from monocular video, we employ GENMO (Li et al., 2025), a dif… view at source ↗

**Figure 5.** Figure 5: Hand motion estimation from generated third-person videos. Given the synthesized human observation (top), our method recovers finegrained 3D hand poses (bottom). Hand Motion Estimation While the estimated whole-body motion M = {qt, pt} T t=1 captures global body kinematics and limb coordination, it does not explicitly encode dexterous hand motion required for manipulation, such as grasp formation, rele… view at source ↗

**Figure 6.** Figure 6: Case study of tasks at the B (Easy) difficulty level. These tasks focus on basic navigation and simple interactions in the environment. {qt, pt, h l t , h r t , sl t , sr t } T t=1, which jointly encodes global body motion, bilateral hand poses, and discrete manipulation states for downstream humanoid motion execution. 2.4 GENERAL MOTION TRACKING DEPLOYMENT The final stage of the ExoActor pipeline is to tr… view at source ↗

**Figure 7.** Figure 7: Case study of tasks at the A (Moderate) difficulty level. These tasks require the robot to perform coarse interactions with the environment and target objects. Level B (Easy). These tasks focus on basic locomotion and goal-directed navigation with minimal interaction. The robot is required to reach target locations or avoid simple obstacles, such as navigating toward objects (e.g., a bottle or basket on a … view at source ↗

**Figure 8.** Figure 8: Case study of tasks at the S (Challenging) difficulty level. These tasks involve fine-grained interactions with the environment and objects. due to residual inaccuracies in motion estimation—particularly in hand height—it is sometimes necessary to place small supporting bases under target items to compensate for height discrepancies during execution to ensure action success. We further discuss this curren… view at source ↗

**Figure 9.** Figure 9: Failure case of video generation: a small view at source ↗

**Figure 10.** Figure 10: Failure case of motion estimation. When parts of the body are obstructed by scene objects (left), the recovered wholebody motion becomes inaccurate and incomplete (right). Failure of Motion Estimation The motion estimation stage may suffer from inaccuracies in fine-grained details, particularly in hand orientation and partially-occluded body parts. For example, incorrect estimation of hand direction or… view at source ↗

**Figure 11.** Figure 11: The prompt used to decompose high-level task goals into robot-style action chain view at source ↗

**Figure 12.** Figure 12: The prompt enforces strict constraints on pose, orientation, and scale to ensure one-to-one view at source ↗

**Figure 12.** Figure 12: The system prompt for robot-to-human replacement under strict pose, orientation, and view at source ↗

**Figure 13.** Figure 13: The prompt used to construct a scene- and task-aware action description from multimodal view at source ↗

**Figure 14.** Figure 14: The prompt template used for B-level navigation task generation. view at source ↗

**Figure 15.** Figure 15: The prompt template used for A/S-level advanced task generation. view at source ↗

read the original abstract

Humanoid control systems have made significant progress in recent years, yet modeling fluent interaction-rich behavior between a robot, its surrounding environment, and task-relevant objects remains a fundamental challenge. This difficulty arises from the need to jointly capture spatial context, temporal dynamics, robot actions, and task intent at scale, which is a poor match to conventional supervision. We propose ExoActor, a novel framework that leverages the generalization capabilities of large-scale video generation models to address this problem. The key insight in ExoActor is to use third-person video generation as a unified interface for modeling interaction dynamics. Given a task instruction and scene context, ExoActor synthesizes plausible execution processes that implicitly encode coordinated interactions between robot, environment, and objects. Such video output is then transformed into executable humanoid behaviors through a pipeline that estimates human motion and executes it via a general motion controller, yielding a task-conditioned behavior sequence. To validate the proposed framework, we implement it as an end-to-end system and demonstrate its generalization to new scenarios without additional real-world data collection. Furthermore, we conclude by discussing limitations of the current implementation and outlining promising directions for future research, illustrating how ExoActor provides a scalable approach to modeling interaction-rich humanoid behaviors, potentially opening a new avenue for generative models to advance general-purpose humanoid intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExoActor frames third-person video generation as the core interface for humanoid interaction modeling, then decodes videos to control, but the abstract supplies no metrics or feasibility checks to show the videos are physically usable on hardware.

read the letter

ExoActor's key move is to generate exocentric videos of a humanoid executing a task and then convert those videos into control signals for the robot. The video model is supposed to capture the necessary dynamics between the robot, objects, and environment implicitly. This is a fresh way to think about the problem. Instead of collecting robot trajectories or building detailed simulators, the system outsources the modeling to a pre-trained video generator conditioned on the task and scene. The pipeline then uses human motion estimation on the video and feeds it to a general motion controller. The authors say they implemented this end-to-end and saw generalization to new cases without extra real data. The paper does a decent job laying out why conventional supervision falls short for interaction-rich behaviors and how video generation could scale better. Discussing limitations at the end shows some self-awareness. The main gap is evidence. There are no quantitative results, no comparison to other approaches, and no analysis of failure cases in the abstract. Video generation models are known to produce non-physical motions, such as incorrect contacts or velocities. The paper does not explain how these issues are avoided or mitigated before the motion estimation step, or how robust the controller is to them. That makes the claim of reliable transfer to hardware hard to evaluate right now. This work is aimed at researchers bridging generative models and robotics. Someone thinking about new interfaces for robot learning would get something out of the high-level design. It should go to peer review. The idea is distinct enough and the motivation is solid, even if the current version needs more empirical grounding to be convincing. Referees can help clarify whether the physical feasibility holds.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ExoActor, a framework that leverages large-scale exocentric video generation models as a unified interface for modeling interaction-rich humanoid behaviors. Given a task instruction and scene context, the system synthesizes videos that implicitly encode coordinated robot-environment-object dynamics; these videos are then converted into executable humanoid control sequences via human motion estimation followed by a general motion controller. The authors claim to have implemented this as an end-to-end system that generalizes to novel scenarios without additional real-world data collection or fine-tuning, while also discussing current limitations and future directions.

Significance. If the central claims are substantiated, the work could be significant for the field of humanoid robotics. It offers a potentially scalable alternative to conventional data-driven supervision by repurposing the generalization capabilities of pre-trained video diffusion models to capture complex, interaction-rich dynamics. This approach could reduce reliance on task-specific real-world datasets and open a new pathway for generative models to contribute to general-purpose humanoid control, provided the physical feasibility and transferability issues are resolved.

major comments (2)

[Abstract] Abstract: The manuscript asserts that an end-to-end system was implemented and that generalization to new scenarios was demonstrated without additional real-world data collection. However, no quantitative metrics (e.g., task success rates, trajectory tracking errors, force/torque violations), baselines, ablation studies, or failure-mode analysis are reported. This absence directly undermines evaluation of the central claim that generated videos encode physically feasible dynamics transferable to hardware.
[Abstract] Pipeline description (Abstract and implied method): The video-to-motion-estimation-to-controller pipeline assumes that a general motion controller can absorb residual non-physical artifacts (e.g., object penetrations, inconsistent velocities, hallucinated contacts) produced by video diffusion models. No characterization of the controller's error tolerance, contact modeling, or sim-to-real gap is provided, nor is any filtering or correction mechanism described. This assumption is load-bearing for the claim of executable, stable behaviors on real humanoid hardware.

minor comments (1)

[Abstract] Abstract: The motion estimation step is described only at a high level ('estimates human motion'); naming the specific pose tracker or model employed would improve reproducibility and clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, acknowledging where the current submission falls short and outlining planned revisions to strengthen the presentation of our framework.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts that an end-to-end system was implemented and that generalization to new scenarios was demonstrated without additional real-world data collection. However, no quantitative metrics (e.g., task success rates, trajectory tracking errors, force/torque violations), baselines, ablation studies, or failure-mode analysis are reported. This absence directly undermines evaluation of the central claim that generated videos encode physically feasible dynamics transferable to hardware.

Authors: We agree that the absence of quantitative metrics, baselines, and failure-mode analysis limits the ability to rigorously evaluate the claims of physical feasibility and hardware transferability. The current manuscript emphasizes the conceptual pipeline and provides qualitative demonstrations of video synthesis and motion transfer to novel scenarios, but does not include numerical results or comparative studies. This reflects the preliminary nature of the implementation in the initial submission. In the revised manuscript, we will add a dedicated evaluation section that reports task success rates in simulation, trajectory tracking metrics where feasible, and an analysis of observed failure modes (such as contact inconsistencies). We will also revise the abstract and claims to more precisely describe the scope of the generalization demonstration as qualitative transfer without fine-tuning, rather than implying comprehensive hardware validation. revision: partial
Referee: [Abstract] Pipeline description (Abstract and implied method): The video-to-motion-estimation-to-controller pipeline assumes that a general motion controller can absorb residual non-physical artifacts (e.g., object penetrations, inconsistent velocities, hallucinated contacts) produced by video diffusion models. No characterization of the controller's error tolerance, contact modeling, or sim-to-real gap is provided, nor is any filtering or correction mechanism described. This assumption is load-bearing for the claim of executable, stable behaviors on real humanoid hardware.

Authors: The referee accurately identifies a core assumption underlying the pipeline. The manuscript describes the conversion from generated video to motion estimates and then to a general motion controller, but provides limited detail on the controller's robustness to video artifacts or its contact modeling. We will revise the methods section to include a characterization of the controller's error tolerance, its handling of contacts and velocities, and a discussion of the sim-to-real considerations. We will also explicitly note the lack of dedicated filtering or correction mechanisms as a current limitation and outline it as a direction for future work. These additions will clarify the load-bearing assumptions and better contextualize the executability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework delegates to external pre-trained models

full rationale

The paper describes a high-level pipeline that uses pre-trained video generation models to synthesize exocentric videos from task instructions and scene context, followed by motion estimation and a general controller to produce humanoid behaviors. No equations, fitted parameters, or first-principles derivations appear in the provided text. Generalization is asserted via empirical implementation and demonstration on new scenarios, without reducing any prediction or result to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The central claim rests on the capabilities of external models rather than internal self-referential logic, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. The framework rests on the assumption that pre-trained video generation models already encode sufficient physical interaction knowledge; no new free parameters, axioms, or invented entities are introduced in the provided text.

axioms (1)

domain assumption Large-scale video generation models can synthesize videos that implicitly encode coordinated, physically plausible interactions between a humanoid, its environment, and task objects.
This is the central premise stated in the key insight paragraph of the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1297 out tokens · 117734 ms · 2026-05-07T06:18:10.287736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C

URLhttps://arxiv.org/abs/2505.03729. Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking,

work page arXiv
[2]

Retargeting Matters: General Motion Retarget- ing for Humanoid Motion Tracking,

URL https://arxiv.org/abs/ 2510.02252. Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15619–15629,

work page arXiv
[4]

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani

URL https://arxiv.org/abs/ 2602.04515. Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning, pp. 3936–3951. PMLR,

work page arXiv
[5]

URL https://openai.com/research/ video-generation-models-as-world-simulators. Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scot...

work page arXiv
[6]

Rethinking video generation model for the embodied world,

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282,

work page arXiv
[7]

Gigabrain-0.5m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099,

URLhttps://arxiv.org/abs/2602.12099. Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717,

work page arXiv
[8]

R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018,

Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiy- ohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018,

work page arXiv
[9]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, Changliu Liu, Guanya Shi, Linxi Fan, and Yuke Zhu. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv: 2511.15200,

work page arXiv
[10]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705,

work page arXiv
[11]

URL https://arxiv.org/abs/2512. 11047. Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977,

work page arXiv
[12]

CoRR , volume =

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, H...

work page arXiv
[13]

Genmo: A generalist model for human motion.arXiv preprint arXiv: 2505.01425,

15 Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion.arXiv preprint arXiv: 2505.01425,

work page arXiv
[14]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026a. Peng Li, Zihan Zhuang, Yangfan Gao, Yi Dong, Sixian Li, Changhao Jiang, Shihan Dou, Zhiheng Xi, Enyu Zhou, Jixuan Huang, Hui Li, Jingjing Gong, Xingjun Ma, Tao ...

work page internal anchor Pith review arXiv
[15]

Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C. Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv: 2508.08241,

work page arXiv
[16]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone

Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, et al. Physbrain: Human egocentric data as a bridge from vision language models to physical intelligence.arXiv preprint arXiv:2512.16793,

work page arXiv
[17]

Self-correcting vla: Online action refinement via sparse world imagination.ArXiv, abs/2602.21633, 2026

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, and Heng Tao Shen. Self-correcting vla: Online action refinement via sparse world imagination.arXiv preprint arXiv:2602.21633, 2026a. Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiagang Zhu, Jiaxiong Qiu, Zheng Zhu, Guan Huang, and Zhizhong Su. Robotransfer: Geometry-c...

work page arXiv
[18]

World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026b

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026b. Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), ...

work page arXiv
[19]

Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang

URLhttps://arxiv.org/abs/2511.07820. Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang. Generalvla: Generalizable vision-language-action models with knowledge-guided trajectory planning.arXiv preprint arXiv: 2602.04315,

work page arXiv
[20]

arXiv preprint arXiv:2601.07823 , year=

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823,

work page arXiv
[21]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

URLhttps://arxiv.org/abs/2501.09747. Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 12242–12254,

work page internal anchor Pith review arXiv
[22]

Advancing open-source world models,

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540,

work page arXiv
[23]

Seedance Team, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

URLhttps://arxiv.org/abs/2602.10106. Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098,

work page arXiv
[25]

URL https://arxiv.org/ abs/2603.03596. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng...

work page arXiv
[26]

Omnixtreme: Breaking the generality barrier in high- dynamic humanoid control, 2026a

17 Yunshen Wang, Shaohang Zhu, Peiyuan Zhi, Yuhan Li, Jiaxin Li, Yong-Lu Li, Yuchen Xiao, Xingxing Wang, Baoxiong Jia, and Siyuan Huang. Omnixtreme: Breaking the generality barrier in high- dynamic humanoid control, 2026a. URLhttps://arxiv.org/abs/2602.23843. Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqin...

work page arXiv
[27]

Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, and Deva Ramanan

URLhttps://arxiv.org/abs/2506.12779. Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, and Deva Ramanan. Crisp: Contact-guided real2sim from monocular video with planar scene primitives, 2026b. URLhttps://arxiv.org/abs/2512.14696. Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Be...

work page arXiv
[28]

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

URLhttps://arxiv.org/abs/2602.05791. Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, and Baorui Ma. Chain of world: World model thinking in latent motion.arXiv preprint arXiv:2603.03195, 2026a. Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learn...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F

URL https://arxiv.org/abs/2603.17117. Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, and Zongqing Lu. Being-0: A humanoid robotic agent with vision-language models and modular skills.arXiv preprint arXiv: 2503.12533,

work page arXiv
[30]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

arXiv preprint arXiv:2603.16666. Chuye Zhang, Xiaoxiong Zhang, Linfang Zheng, Wei Pan, and Wei Zhang. Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation. In9th Annual Conference on Robot Learning (CoRL 2025),

work page internal anchor Pith review arXiv 2025
[31]

Grounding generated videos in feasible plans via world models.arXiv preprint arXiv:2602.01960,

18 Christos Ziakas, Amir Bar, and Alessandra Russo. Grounding generated videos in feasible plans via world models.arXiv preprint arXiv:2602.01960,

work page arXiv