Co-Evolving Latent Action World Models

De-Chuan Zhan; Fengming Zhang; Jiang Bian; Kaixin Wang; Li Zhao; Yucen Wang

arxiv: 2510.26433 · v2 · submitted 2025-10-30 · 💻 cs.LG

Co-Evolving Latent Action World Models

Yucen Wang , Fengming Zhang , De-Chuan Zhan , Li Zhao , Kaixin Wang , Jiang Bian This is my paper

Pith reviewed 2026-05-18 02:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords latent action modelworld modeljoint trainingwarm-up phaserepresentation alignmentvideo simulationvisual planningco-evolution

0 comments

The pith

A warm-up phase aligns latent action models with pretrained world models to enable stable joint training and co-evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt pretrained video models into controllable world models by training the latent action model and world model together instead of in two separate stages. The joint approach risks representational collapse, but a dedicated warm-up phase first aligns the randomly initialized action model with the fixed world model. This alignment starts a cycle where the world model supplies gradients to refine the action model while the action model supplies more precise controls back to the world model. The result is video simulation and visual planning performance that matches or exceeds prior separate-training methods. If the approach holds, training pipelines for generalist world models could become both simpler and more effective.

Core claim

CoLA-World realizes the synergistic paradigm of jointly training a latent action model and a world model by using a critical warm-up phase that aligns their representations, unlocking a co-evolution cycle in which the world model shapes a high-quality latent action model while the latent action model provides a more precise control interface.

What carries the argument

The warm-up phase that aligns representations of the from-scratch latent action model with the pretrained world model to prevent representational collapse and enable stable beneficial co-adaptation.

Load-bearing premise

A dedicated warm-up phase can reliably align representations between a randomly initialized latent action model and a pretrained world model sufficiently to prevent collapse and enable stable joint training.

What would settle it

Joint training without the warm-up phase produces representational collapse and performance no better than or worse than separate two-stage training.

Figures

Figures reproduced from arXiv: 2510.26433 by De-Chuan Zhan, Fengming Zhang, Jiang Bian, Kaixin Wang, Li Zhao, Yucen Wang.

**Figure 1.** Figure 1: (a) Prior works use a two-stage pipeline: learn a latent action model (LAM), then fix it to train the world model. (b) We propose a one-stage pipeline, directly using the world model as the forward dynamics model and backpropagating gradients through latent actions. for action conditioning and world modeling. IRASim [42] uses adaptive layer normalization [28] to incorporate actions, analogous to how text p… view at source ↗

**Figure 2.** Figure 2: Latent action codebook metrics during joint training of the IDM and world model. “rand” [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Latent action codebook metrics during warm-up and joint training. Different blue curves [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evidence of synergistic co-evolution. The LAM’s probing loss drops faster when the world [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Codebook metrics in different training and adaptation stages. All subplots share the same [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Action transfer results. The source and target videos comes from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pretrained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoLA-World, a framework for jointly training a latent action model (LAM) and a pretrained world model via a dedicated warm-up phase that aligns their representations from scratch, thereby avoiding collapse and enabling a co-evolution cycle where the world model tutors the LAM and the LAM provides precise control. This is positioned as an advance over dominant two-stage separate training, with empirical results showing matching or superior video simulation quality and downstream visual planning performance.

Significance. If the warm-up mechanism demonstrably enables stable co-adaptation without collapse, the work would meaningfully advance the field by replacing redundant two-stage pipelines with a synergistic joint-training paradigm for controllable world models, potentially improving efficiency and representation quality in video-based planning and simulation tasks.

major comments (3)

[Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.
[Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.
[Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.

minor comments (2)

[Section 3.3] Notation for the joint loss and gradient flow during co-evolution could be clarified with an explicit equation showing how LAM and world-model parameters are updated in the same step.
[Figure 2] Figure illustrating the training pipeline would benefit from explicit arrows or labels distinguishing the warm-up phase from the subsequent joint training loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the empirical support needed for our claims about the warm-up phase and co-evolution. We address each major point below and will revise the manuscript to strengthen the validation of our method.

read point-by-point responses

Referee: [Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.

Authors: We agree that direct measurements would provide clearer verification of the alignment effect. In the revised manuscript, we will add analysis in the Methods section (and a new figure) reporting cosine similarity and latent variance between LAM and world-model latents before and after the warm-up phase. These metrics will quantify the alignment achieved and support the mechanism enabling stable co-evolution. revision: yes
Referee: [Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.

Authors: We acknowledge that an explicit ablation isolating the warm-up is essential to demonstrate its necessity. We will add this ablation to Section 4, training without the warm-up (random initialization throughout) and comparing collapse indicators (e.g., latent divergence metrics) as well as final video quality and planning performance against the full pipeline. This will show that the warm-up is load-bearing for avoiding collapse and achieving the reported results. revision: yes
Referee: [Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.

Authors: We agree that error bars and statistical tests would strengthen attribution of gains to the co-evolution cycle. In the revision, we will rerun the primary experiments across multiple random seeds and report mean and standard deviation in the quantitative tables. We will also include statistical significance tests (e.g., paired t-tests) comparing CoLA-World to the two-stage baselines to better isolate the contribution of joint training. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical training procedure consisting of a warm-up phase to align a randomly initialized latent action model with a pretrained world model, followed by joint co-evolution training. No equations, derivations, or parameter-fitting steps are presented that reduce any claimed output (such as improved video quality or planning performance) to a quantity defined by the method itself. The warm-up phase is introduced as an independent procedural intervention rather than a self-referential fit or renamed input. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core mechanism. The central claims rest on reported empirical matches or improvements over two-stage baselines, rendering the argument self-contained against external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard deep learning assumptions about gradient-based optimization and representation learning in video models, with no explicit free parameters, axioms, or invented entities detailed in the abstract.

axioms (1)

domain assumption Pretrained video generation models contain useful dynamics representations that can be adapted for controllable world modeling via latent actions.
This is the foundational premise for adapting video models into world models.

pith-pipeline@v0.9.0 · 5734 in / 1248 out tokens · 61924 ms · 2026-05-18T02:48:29.640449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

naively training the IDM and world model together can easily lead to collapse... warm-up phase in which the world model is kept frozen and only supplies gradients to update the IDM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
cs.RO 2026-02 unverdicted novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 3 Pith papers · 11 internal anchors

[1]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al

Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, pp. 4603–4623. PMLR, 2024

work page 2024
[4]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, June 2022

work page 2022
[6]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv: 2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

and Gao, Y

Cui, H. and Gao, Y. A universal world model learned from large scale and diverse videos. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023

work page 2023
[9]

M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al

Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE T ransactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020
[10]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for T ask and Motion Planning, 2023

work page 2023
[11]

Adaworld: Learning adaptable world models with latent actions

Gao, S., Zhou, S., Du, Y., Zhang, J., and Gan, C. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[12]

The "something something" video database for learning and evaluating visual common sense

Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Doulaty, M., Erapalli, A., Feichtenhofer, C., Fragom...

work page 2022
[14]

World Models

Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

P ., Norouzi, M., and Ba, J

Hafner, D., Lillicrap, T. P ., Norouzi, M., and Ba, J. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=0oabwyZbOu

work page 2021
[16]

Pre-trained video generative models as world simulators,

He, H., Zhang, Y., Lin, L., Xu, Z., and Pan, L. Pre-trained video generative models as world simulators.arXiv preprint arXiv: 2502.07825, 2025

work page arXiv 2025
[17]

Enerverse: Envisioning embodied future space for robotics manipulation

Huang, S., Chen, L., Zhou, P ., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P ., Li, H., Yao, M., et al. Ener- verse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895, 2025

work page arXiv 2025
[18]

Huang, J

Huang, S., Wu, J., Zhou, Q., Miao, S., and Long, M. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv: 2505.14357, 2025

work page arXiv 2025
[19]

Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Jiang, Y., Chen, S., Huang, S., Chen, L., Zhou, P ., Liao, Y., He, X., Liu, C., Li, H., Yao, M., et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

work page arXiv 2025
[20]

Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

Kannan, H., Hafner, D., Finn, C., and Erhan, D. Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

work page 2021
[21]

Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018. 11 CoLA-World: Co-Evolving Latent Action World Models

work page 2018
[22]

Egocentric prediction of action target in 3d

Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

work page 2022
[23]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022

work page 2022
[26]

NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Sora: Creating video from text.https://openai.com/sora, 2024

OpenAI. Sora: Creating video from text.https://openai.com/sora, 2024. Accessed: 2025-09-18

work page 2024
[28]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023

work page 2023
[29]

AVID: Adapting video diffusion models to world models

Rigter, M., Gupta, T., Hilmkil, A., and Ma, C. AVID: Adapting video diffusion models to world models. InReinforcement Learning Conference, 2025. URL https://openreview.net/forum?id= C18kcGeqAW

work page 2025
[30]

and Jiang, M

Schmidt, D. and Jiang, M. Learning to act without actions.International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2312.10812

work page doi:10.48550/arxiv.2312.10812 2023
[31]

Sutton, R. S. Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the seventh international conference (1990) on Machine learning, pp. 216–224, 1990

work page 1990
[32]

A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

Tian, S., Finn, C., and Wu, J. A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

work page arXiv 2023
[33]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024

Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843

work page arXiv 2024
[35]

V ., Joshi, N., and Pollefeys, M

Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023

work page 2023
[36]

Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

Wang, Y., Wan, S., Gan, L., Feng, S., and Zhan, D.-C. Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

work page
[37]

doi: 10.48550/arXiv.2403.09976

work page doi:10.48550/arxiv.2403.09976
[38]

arXiv preprint arXiv:2001.02908 (2020)

Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

work page arXiv 2001
[39]

J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B

Ye, S., Jang, J., Jeon, B., Joo, S. J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=VYOe2eBQeh. 12 CoLA-World: Co...

work page 2025
[40]

Become a proficient player with limited data through watching pure videos

Ye, W., Zhang, Y., Abbeel, P ., and Gao, Y. Become a proficient player with limited data through watching pure videos. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Sy-o2N0hF4f

work page 2023
[41]

Prelar: World model pre-training with learnable action representation

Zhang, L., Kan, M., Shan, S., and Chen, X. Prelar: World model pre-training with learnable action representation. InEuropean Conference on Computer Vision, pp. 185–201. Springer, 2024

work page 2024
[42]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv: 2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Irasim: A fine-grained world model for robot manipulation, 2025

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., and Kong, T. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024. 13 CoLA-World: Co-Evolving Latent Action World Models A Dataset We mainly focus on learning a latent action model and a world model for manipulation tasks that involve diverse downstream embodiments a...

work page arXiv 2024

[1] [1]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al

Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, pp. 4603–4623. PMLR, 2024

work page 2024

[4] [4]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, June 2022

work page 2022

[6] [6]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv: 2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

and Gao, Y

Cui, H. and Gao, Y. A universal world model learned from large scale and diverse videos. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023

work page 2023

[9] [9]

M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al

Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE T ransactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020

[10] [10]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for T ask and Motion Planning, 2023

work page 2023

[11] [11]

Adaworld: Learning adaptable world models with latent actions

Gao, S., Zhou, S., Du, Y., Zhang, J., and Gan, C. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[12] [12]

The "something something" video database for learning and evaluating visual common sense

Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Doulaty, M., Erapalli, A., Feichtenhofer, C., Fragom...

work page 2022

[14] [14]

World Models

Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

P ., Norouzi, M., and Ba, J

Hafner, D., Lillicrap, T. P ., Norouzi, M., and Ba, J. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=0oabwyZbOu

work page 2021

[16] [16]

Pre-trained video generative models as world simulators,

He, H., Zhang, Y., Lin, L., Xu, Z., and Pan, L. Pre-trained video generative models as world simulators.arXiv preprint arXiv: 2502.07825, 2025

work page arXiv 2025

[17] [17]

Enerverse: Envisioning embodied future space for robotics manipulation

Huang, S., Chen, L., Zhou, P ., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P ., Li, H., Yao, M., et al. Ener- verse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895, 2025

work page arXiv 2025

[18] [18]

Huang, J

Huang, S., Wu, J., Zhou, Q., Miao, S., and Long, M. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv: 2505.14357, 2025

work page arXiv 2025

[19] [19]

Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Jiang, Y., Chen, S., Huang, S., Chen, L., Zhou, P ., Liao, Y., He, X., Liu, C., Li, H., Yao, M., et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

work page arXiv 2025

[20] [20]

Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

Kannan, H., Hafner, D., Finn, C., and Erhan, D. Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

work page 2021

[21] [21]

Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018. 11 CoLA-World: Co-Evolving Latent Action World Models

work page 2018

[22] [22]

Egocentric prediction of action target in 3d

Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

work page 2022

[23] [23]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022

work page 2022

[26] [26]

NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Sora: Creating video from text.https://openai.com/sora, 2024

OpenAI. Sora: Creating video from text.https://openai.com/sora, 2024. Accessed: 2025-09-18

work page 2024

[28] [28]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023

work page 2023

[29] [29]

AVID: Adapting video diffusion models to world models

Rigter, M., Gupta, T., Hilmkil, A., and Ma, C. AVID: Adapting video diffusion models to world models. InReinforcement Learning Conference, 2025. URL https://openreview.net/forum?id= C18kcGeqAW

work page 2025

[30] [30]

and Jiang, M

Schmidt, D. and Jiang, M. Learning to act without actions.International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2312.10812

work page doi:10.48550/arxiv.2312.10812 2023

[31] [31]

Sutton, R. S. Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the seventh international conference (1990) on Machine learning, pp. 216–224, 1990

work page 1990

[32] [32]

A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

Tian, S., Finn, C., and Wu, J. A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

work page arXiv 2023

[33] [33]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[34] [34]

Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024

Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843

work page arXiv 2024

[35] [35]

V ., Joshi, N., and Pollefeys, M

Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023

work page 2023

[36] [36]

Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

Wang, Y., Wan, S., Gan, L., Feng, S., and Zhan, D.-C. Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

work page

[37] [37]

doi: 10.48550/arXiv.2403.09976

work page doi:10.48550/arxiv.2403.09976

[38] [38]

arXiv preprint arXiv:2001.02908 (2020)

Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

work page arXiv 2001

[39] [39]

J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B

Ye, S., Jang, J., Jeon, B., Joo, S. J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=VYOe2eBQeh. 12 CoLA-World: Co...

work page 2025

[40] [40]

Become a proficient player with limited data through watching pure videos

Ye, W., Zhang, Y., Abbeel, P ., and Gao, Y. Become a proficient player with limited data through watching pure videos. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Sy-o2N0hF4f

work page 2023

[41] [41]

Prelar: World model pre-training with learnable action representation

Zhang, L., Kan, M., Shan, S., and Chen, X. Prelar: World model pre-training with learnable action representation. InEuropean Conference on Computer Vision, pp. 185–201. Springer, 2024

work page 2024

[42] [42]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv: 2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Irasim: A fine-grained world model for robot manipulation, 2025

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., and Kong, T. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024. 13 CoLA-World: Co-Evolving Latent Action World Models A Dataset We mainly focus on learning a latent action model and a world model for manipulation tasks that involve diverse downstream embodiments a...

work page arXiv 2024