Co-Evolving Latent Action World Models
Pith reviewed 2026-05-18 02:48 UTC · model grok-4.3
The pith
A warm-up phase aligns latent action models with pretrained world models to enable stable joint training and co-evolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoLA-World realizes the synergistic paradigm of jointly training a latent action model and a world model by using a critical warm-up phase that aligns their representations, unlocking a co-evolution cycle in which the world model shapes a high-quality latent action model while the latent action model provides a more precise control interface.
What carries the argument
The warm-up phase that aligns representations of the from-scratch latent action model with the pretrained world model to prevent representational collapse and enable stable beneficial co-adaptation.
Load-bearing premise
A dedicated warm-up phase can reliably align representations between a randomly initialized latent action model and a pretrained world model sufficiently to prevent collapse and enable stable joint training.
What would settle it
Joint training without the warm-up phase produces representational collapse and performance no better than or worse than separate two-stage training.
Figures
read the original abstract
Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pretrained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoLA-World, a framework for jointly training a latent action model (LAM) and a pretrained world model via a dedicated warm-up phase that aligns their representations from scratch, thereby avoiding collapse and enabling a co-evolution cycle where the world model tutors the LAM and the LAM provides precise control. This is positioned as an advance over dominant two-stage separate training, with empirical results showing matching or superior video simulation quality and downstream visual planning performance.
Significance. If the warm-up mechanism demonstrably enables stable co-adaptation without collapse, the work would meaningfully advance the field by replacing redundant two-stage pipelines with a synergistic joint-training paradigm for controllable world models, potentially improving efficiency and representation quality in video-based planning and simulation tasks.
major comments (3)
- [Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.
- [Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.
- [Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.
minor comments (2)
- [Section 3.3] Notation for the joint loss and gradient flow during co-evolution could be clarified with an explicit equation showing how LAM and world-model parameters are updated in the same step.
- [Figure 2] Figure illustrating the training pipeline would benefit from explicit arrows or labels distinguishing the warm-up phase from the subsequent joint training loop.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the empirical support needed for our claims about the warm-up phase and co-evolution. We address each major point below and will revise the manuscript to strengthen the validation of our method.
read point-by-point responses
-
Referee: [Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.
Authors: We agree that direct measurements would provide clearer verification of the alignment effect. In the revised manuscript, we will add analysis in the Methods section (and a new figure) reporting cosine similarity and latent variance between LAM and world-model latents before and after the warm-up phase. These metrics will quantify the alignment achieved and support the mechanism enabling stable co-evolution. revision: yes
-
Referee: [Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.
Authors: We acknowledge that an explicit ablation isolating the warm-up is essential to demonstrate its necessity. We will add this ablation to Section 4, training without the warm-up (random initialization throughout) and comparing collapse indicators (e.g., latent divergence metrics) as well as final video quality and planning performance against the full pipeline. This will show that the warm-up is load-bearing for avoiding collapse and achieving the reported results. revision: yes
-
Referee: [Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.
Authors: We agree that error bars and statistical tests would strengthen attribution of gains to the co-evolution cycle. In the revision, we will rerun the primary experiments across multiple random seeds and report mean and standard deviation in the quantitative tables. We will also include statistical significance tests (e.g., paired t-tests) comparing CoLA-World to the two-stage baselines to better isolate the contribution of joint training. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical training procedure consisting of a warm-up phase to align a randomly initialized latent action model with a pretrained world model, followed by joint co-evolution training. No equations, derivations, or parameter-fitting steps are presented that reduce any claimed output (such as improved video quality or planning performance) to a quantity defined by the method itself. The warm-up phase is introduced as an independent procedural intervention rather than a self-referential fit or renamed input. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core mechanism. The central claims rest on reported empirical matches or improvements over two-stage baselines, rendering the argument self-contained against external benchmarks without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained video generation models contain useful dynamics representations that can be adapted for controllable world modeling via latent actions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
naively training the IDM and world model together can easily lead to collapse... warm-up phase in which the world model is kept frozen and only supplies gradients to update the IDM
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.
Reference graph
Works this paper leans on
-
[1]
AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, pp. 4603–4623. PMLR, 2024
work page 2024
-
[4]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, June 2022
work page 2022
-
[6]
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv: 2507.23682, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Cui, H. and Gao, Y. A universal world model learned from large scale and diverse videos. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023
work page 2023
-
[9]
M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE T ransactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020
work page 2020
-
[10]
Rh20t: A robotic dataset for learning diverse skills in one-shot
Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for T ask and Motion Planning, 2023
work page 2023
-
[11]
Adaworld: Learning adaptable world models with latent actions
Gao, S., Zhou, S., Du, Y., Zhang, J., and Gan, C. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[12]
The "something something" video database for learning and evaluating visual common sense
Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Doulaty, M., Erapalli, A., Feichtenhofer, C., Fragom...
work page 2022
-
[14]
Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Hafner, D., Lillicrap, T. P ., Norouzi, M., and Ba, J. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=0oabwyZbOu
work page 2021
-
[16]
Pre-trained video generative models as world simulators,
He, H., Zhang, Y., Lin, L., Xu, Z., and Pan, L. Pre-trained video generative models as world simulators.arXiv preprint arXiv: 2502.07825, 2025
-
[17]
Enerverse: Envisioning embodied future space for robotics manipulation
Huang, S., Chen, L., Zhou, P ., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P ., Li, H., Yao, M., et al. Ener- verse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895, 2025
- [18]
-
[19]
Jiang, Y., Chen, S., Huang, S., Chen, L., Zhou, P ., Liao, Y., He, X., Liu, C., Li, H., Yao, M., et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025
-
[20]
Kannan, H., Hafner, D., Finn, C., and Erhan, D. Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021
work page 2021
-
[21]
Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018. 11 CoLA-World: Co-Evolving Latent Action World Models
work page 2018
-
[22]
Egocentric prediction of action target in 3d
Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022
work page 2022
-
[23]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022
work page 2022
-
[26]
NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Sora: Creating video from text.https://openai.com/sora, 2024
OpenAI. Sora: Creating video from text.https://openai.com/sora, 2024. Accessed: 2025-09-18
work page 2024
-
[28]
Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023
work page 2023
-
[29]
AVID: Adapting video diffusion models to world models
Rigter, M., Gupta, T., Hilmkil, A., and Ma, C. AVID: Adapting video diffusion models to world models. InReinforcement Learning Conference, 2025. URL https://openreview.net/forum?id= C18kcGeqAW
work page 2025
-
[30]
Schmidt, D. and Jiang, M. Learning to act without actions.International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2312.10812
-
[31]
Sutton, R. S. Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the seventh international conference (1990) on Machine learning, pp. 216–224, 1990
work page 1990
-
[32]
A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023
Tian, S., Finn, C., and Wu, J. A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023
-
[33]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[34]
Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843
-
[35]
V ., Joshi, N., and Pollefeys, M
Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023
work page 2023
-
[36]
Wang, Y., Wan, S., Gan, L., Feng, S., and Zhan, D.-C. Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,
-
[37]
doi: 10.48550/arXiv.2403.09976
-
[38]
arXiv preprint arXiv:2001.02908 (2020)
Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020
-
[39]
J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B
Ye, S., Jang, J., Jeon, B., Joo, S. J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=VYOe2eBQeh. 12 CoLA-World: Co...
work page 2025
-
[40]
Become a proficient player with limited data through watching pure videos
Ye, W., Zhang, Y., Abbeel, P ., and Gao, Y. Become a proficient player with limited data through watching pure videos. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Sy-o2N0hF4f
work page 2023
-
[41]
Prelar: World model pre-training with learnable action representation
Zhang, L., Kan, M., Shan, S., and Chen, X. Prelar: World model pre-training with learnable action representation. InEuropean Conference on Computer Vision, pp. 185–201. Springer, 2024
work page 2024
-
[42]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv: 2412.20404, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Irasim: A fine-grained world model for robot manipulation, 2025
Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., and Kong, T. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024. 13 CoLA-World: Co-Evolving Latent Action World Models A Dataset We mainly focus on learning a latent action model and a world model for manipulation tasks that involve diverse downstream embodiments a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.