pith. sign in

arxiv: 2510.26433 · v2 · submitted 2025-10-30 · 💻 cs.LG

Co-Evolving Latent Action World Models

Pith reviewed 2026-05-18 02:48 UTC · model grok-4.3

classification 💻 cs.LG
keywords latent action modelworld modeljoint trainingwarm-up phaserepresentation alignmentvideo simulationvisual planningco-evolution
0
0 comments X

The pith

A warm-up phase aligns latent action models with pretrained world models to enable stable joint training and co-evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt pretrained video models into controllable world models by training the latent action model and world model together instead of in two separate stages. The joint approach risks representational collapse, but a dedicated warm-up phase first aligns the randomly initialized action model with the fixed world model. This alignment starts a cycle where the world model supplies gradients to refine the action model while the action model supplies more precise controls back to the world model. The result is video simulation and visual planning performance that matches or exceeds prior separate-training methods. If the approach holds, training pipelines for generalist world models could become both simpler and more effective.

Core claim

CoLA-World realizes the synergistic paradigm of jointly training a latent action model and a world model by using a critical warm-up phase that aligns their representations, unlocking a co-evolution cycle in which the world model shapes a high-quality latent action model while the latent action model provides a more precise control interface.

What carries the argument

The warm-up phase that aligns representations of the from-scratch latent action model with the pretrained world model to prevent representational collapse and enable stable beneficial co-adaptation.

Load-bearing premise

A dedicated warm-up phase can reliably align representations between a randomly initialized latent action model and a pretrained world model sufficiently to prevent collapse and enable stable joint training.

What would settle it

Joint training without the warm-up phase produces representational collapse and performance no better than or worse than separate two-stage training.

Figures

Figures reproduced from arXiv: 2510.26433 by De-Chuan Zhan, Fengming Zhang, Jiang Bian, Kaixin Wang, Li Zhao, Yucen Wang.

Figure 1
Figure 1. Figure 1: (a) Prior works use a two-stage pipeline: learn a latent action model (LAM), then fix it to train the world model. (b) We propose a one-stage pipeline, directly using the world model as the forward dynamics model and backpropagating gradients through latent actions. for action conditioning and world modeling. IRASim [42] uses adaptive layer normalization [28] to incorporate actions, analogous to how text p… view at source ↗
Figure 2
Figure 2. Figure 2: Latent action codebook metrics during joint training of the IDM and world model. “rand” [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Latent action codebook metrics during warm-up and joint training. Different blue curves [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evidence of synergistic co-evolution. The LAM’s probing loss drops faster when the world [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Codebook metrics in different training and adaptation stages. All subplots share the same [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Action transfer results. The source and target videos comes from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Adapting pretrained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pretrained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoLA-World, a framework for jointly training a latent action model (LAM) and a pretrained world model via a dedicated warm-up phase that aligns their representations from scratch, thereby avoiding collapse and enabling a co-evolution cycle where the world model tutors the LAM and the LAM provides precise control. This is positioned as an advance over dominant two-stage separate training, with empirical results showing matching or superior video simulation quality and downstream visual planning performance.

Significance. If the warm-up mechanism demonstrably enables stable co-adaptation without collapse, the work would meaningfully advance the field by replacing redundant two-stage pipelines with a synergistic joint-training paradigm for controllable world models, potentially improving efficiency and representation quality in video-based planning and simulation tasks.

major comments (3)
  1. [Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.
  2. [Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.
  3. [Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.
minor comments (2)
  1. [Section 3.3] Notation for the joint loss and gradient flow during co-evolution could be clarified with an explicit equation showing how LAM and world-model parameters are updated in the same step.
  2. [Figure 2] Figure illustrating the training pipeline would benefit from explicit arrows or labels distinguishing the warm-up phase from the subsequent joint training loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the empirical support needed for our claims about the warm-up phase and co-evolution. We address each major point below and will revise the manuscript to strengthen the validation of our method.

read point-by-point responses
  1. Referee: [Methods (warm-up phase description)] Methods section on the warm-up procedure: the central claim that this phase 'effectively aligns the representations' and unlocks co-evolution lacks direct supporting measurements (e.g., cosine similarity, mutual information, or latent variance between LAM and world-model latents) tracked before versus after warm-up, leaving the mechanism unverified.

    Authors: We agree that direct measurements would provide clearer verification of the alignment effect. In the revised manuscript, we will add analysis in the Methods section (and a new figure) reporting cosine similarity and latent variance between LAM and world-model latents before and after the warm-up phase. These metrics will quantify the alignment achieved and support the mechanism enabling stable co-evolution. revision: yes

  2. Referee: [Section 4 (ablations and baselines)] Experiments and ablations: no ablation is reported that trains without the warm-up phase (or with random initialization throughout) and directly compares collapse indicators or final performance against the full CoLA-World pipeline, which is required to establish that the warm-up is load-bearing rather than incidental to hyperparameter choices or the pretrained checkpoint.

    Authors: We acknowledge that an explicit ablation isolating the warm-up is essential to demonstrate its necessity. We will add this ablation to Section 4, training without the warm-up (random initialization throughout) and comparing collapse indicators (e.g., latent divergence metrics) as well as final video quality and planning performance against the full pipeline. This will show that the warm-up is load-bearing for avoiding collapse and achieving the reported results. revision: yes

  3. Referee: [Results (quantitative tables)] Results tables (e.g., video quality and planning metrics): reported gains over two-stage baselines are presented without error bars across multiple random seeds or statistical significance tests, making it difficult to attribute improvements specifically to the co-evolution cycle versus other implementation details.

    Authors: We agree that error bars and statistical tests would strengthen attribution of gains to the co-evolution cycle. In the revision, we will rerun the primary experiments across multiple random seeds and report mean and standard deviation in the quantitative tables. We will also include statistical significance tests (e.g., paired t-tests) comparing CoLA-World to the two-stage baselines to better isolate the contribution of joint training. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical training procedure consisting of a warm-up phase to align a randomly initialized latent action model with a pretrained world model, followed by joint co-evolution training. No equations, derivations, or parameter-fitting steps are presented that reduce any claimed output (such as improved video quality or planning performance) to a quantity defined by the method itself. The warm-up phase is introduced as an independent procedural intervention rather than a self-referential fit or renamed input. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core mechanism. The central claims rest on reported empirical matches or improvements over two-stage baselines, rendering the argument self-contained against external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard deep learning assumptions about gradient-based optimization and representation learning in video models, with no explicit free parameters, axioms, or invented entities detailed in the abstract.

axioms (1)
  • domain assumption Pretrained video generation models contain useful dynamics representations that can be adapted for controllable world modeling via latent actions.
    This is the foundational premise for adapting video models into world models.

pith-pipeline@v0.9.0 · 5734 in / 1248 out tokens · 61924 ms · 2026-05-18T02:48:29.640449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  2. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    cs.RO 2026-02 unverdicted novelty 7.0

    DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...

  3. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  4. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al

    Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, pp. 4603–4623. PMLR, 2024

  4. [4]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  5. [5]

    Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, June 2022

  6. [6]

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv: 2507.23682, 2025

  7. [7]

    Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...

  8. [8]

    and Gao, Y

    Cui, H. and Gao, Y. A universal world model learned from large scale and diverse videos. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023

  9. [9]

    M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al

    Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE T ransactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

  10. [10]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for T ask and Motion Planning, 2023

  11. [11]

    Adaworld: Learning adaptable world models with latent actions

    Gao, S., Zhou, S., Du, Y., Zhang, J., and Gan, C. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), 2025

  12. [12]

    The "something something" video database for learning and evaluating visual common sense

    Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261

  13. [13]

    K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Doulaty, M., Erapalli, A., Feichtenhofer, C., Fragom...

  14. [14]

    World Models

    Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2018

  15. [15]

    P ., Norouzi, M., and Ba, J

    Hafner, D., Lillicrap, T. P ., Norouzi, M., and Ba, J. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=0oabwyZbOu

  16. [16]

    Pre-trained video generative models as world simulators,

    He, H., Zhang, Y., Lin, L., Xu, Z., and Pan, L. Pre-trained video generative models as world simulators.arXiv preprint arXiv: 2502.07825, 2025

  17. [17]

    Enerverse: Envisioning embodied future space for robotics manipulation

    Huang, S., Chen, L., Zhou, P ., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P ., Li, H., Yao, M., et al. Ener- verse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895, 2025

  18. [18]

    Huang, J

    Huang, S., Wu, J., Zhou, Q., Miao, S., and Long, M. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv: 2505.14357, 2025

  19. [19]

    Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

    Jiang, Y., Chen, S., Huang, S., Chen, L., Zhou, P ., Liao, Y., He, X., Liu, C., Li, H., Yao, M., et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

  20. [20]

    Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

    Kannan, H., Hafner, D., Finn, C., and Erhan, D. Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk, 2021

  21. [21]

    Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018. 11 CoLA-World: Co-Evolving Latent Action World Models

  22. [22]

    Egocentric prediction of action target in 3d

    Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

  23. [23]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  24. [24]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  25. [25]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022

  26. [26]

    NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...

  27. [27]

    Sora: Creating video from text.https://openai.com/sora, 2024

    OpenAI. Sora: Creating video from text.https://openai.com/sora, 2024. Accessed: 2025-09-18

  28. [28]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023

  29. [29]

    AVID: Adapting video diffusion models to world models

    Rigter, M., Gupta, T., Hilmkil, A., and Ma, C. AVID: Adapting video diffusion models to world models. InReinforcement Learning Conference, 2025. URL https://openreview.net/forum?id= C18kcGeqAW

  30. [30]

    and Jiang, M

    Schmidt, D. and Jiang, M. Learning to act without actions.International Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2312.10812

  31. [31]

    Sutton, R. S. Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the seventh international conference (1990) on Machine learning, pp. 216–224, 1990

  32. [32]

    A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

    Tian, S., Finn, C., and Wu, J. A control-centric benchmark for video prediction.arXiv preprint arXiv:2304.13723, 2023

  33. [33]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  34. [34]

    Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024

    Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843

  35. [35]

    V ., Joshi, N., and Pollefeys, M

    Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023

  36. [36]

    Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

    Wang, Y., Wan, S., Gan, L., Feng, S., and Zhan, D.-C. Ad3: Implicit action is the key for world models to distinguish the diverse visual distractors.International Conference on Machine Learning,

  37. [37]

    doi: 10.48550/arXiv.2403.09976

  38. [38]

    arXiv preprint arXiv:2001.02908 (2020)

    Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

  39. [39]

    J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B

    Ye, S., Jang, J., Jeon, B., Joo, S. J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=VYOe2eBQeh. 12 CoLA-World: Co...

  40. [40]

    Become a proficient player with limited data through watching pure videos

    Ye, W., Zhang, Y., Abbeel, P ., and Gao, Y. Become a proficient player with limited data through watching pure videos. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Sy-o2N0hF4f

  41. [41]

    Prelar: World model pre-training with learnable action representation

    Zhang, L., Kan, M., Shan, S., and Chen, X. Prelar: World model pre-training with learnable action representation. InEuropean Conference on Computer Vision, pp. 185–201. Springer, 2024

  42. [42]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv: 2412.20404, 2024

  43. [43]

    Irasim: A fine-grained world model for robot manipulation, 2025

    Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., and Kong, T. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024. 13 CoLA-World: Co-Evolving Latent Action World Models A Dataset We mainly focus on learning a latent action model and a world model for manipulation tasks that involve diverse downstream embodiments a...