Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Dongzhou Cheng; Jiaqi Wang; Juan Wang; Lingxuan Weng; Shiyue Wang; Xiaoyang Xu; Yibin Wang; Ziang Li

arxiv: 2606.08242 · v1 · pith:WBU5L4WAnew · submitted 2026-06-06 · 💻 cs.CV

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Ziang Li , Dongzhou Cheng , Yibin Wang , Shiyue Wang , Xiaoyang Xu , Lingxuan Weng , Juan Wang , Jiaqi Wang This is my paper

Pith reviewed 2026-06-27 19:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords robot manipulationworld action modelsefficient policieslatent video supervisionstate fusionaction chunk predictionmulti-task learning

0 comments

The pith

A lightweight world action model achieves competitive robot manipulation performance with 0.44B parameters by supervising future video only in downsampled latent space and using state-fusion action decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World Action Models add future prediction to robot policy training so that representations capture task-relevant temporal structure. Existing versions rely on large generative networks that raise training cost and slow inference, limiting closed-loop deployment. Light-WAM replaces those heavy components with a compact video backbone whose future-video loss is computed only after downsampling into latent space. A new StateFusionActionExpert then pools adapted states from several backbone layers through learned queries and outputs action chunks in one pass. The resulting system keeps the representation benefit of video co-training while cutting parameter count, latency, and memory enough for practical multi-task use on standard benchmarks.

Core claim

Light-WAM demonstrates that future-video supervision performed in a downsampled latent space, paired with a StateFusionActionExpert that fuses multi-layer states via learned-query pooling, allows a 0.44B-parameter model to retain strong performance on LIBERO and deliver usable multi-task results on RoboTwin 2.0, while reaching 72.03 ms inference latency and 4.1 GiB peak GPU memory.

What carries the argument

The StateFusionActionExpert, which reads adapted states from multiple layers of a compact video backbone, fuses them through learned-query pooling, and directly outputs action chunks in a single forward pass.

If this is right

Maintains strong performance on the LIBERO benchmark.
Achieves usable multi-task performance on RoboTwin 2.0.
Requires only 0.44B trainable parameters.
Reaches 72.03 ms inference latency with 4.1 GiB peak GPU memory.
Improves training throughput relative to prior heavy generative WAMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of video representation learning from direct action prediction could let the same backbone serve multiple robot embodiments without retraining the full model.
Downsampling the supervision signal may permit longer prediction horizons or higher frame rates without proportional growth in compute.
If the latent-space objective generalizes, similar efficiency gains might appear in other vision-based control domains that currently rely on full generative video models.

Load-bearing premise

That performing future-video supervision only in a downsampled latent space retains the representation-learning benefits of full video co-training without introducing new failure modes on manipulation tasks.

What would settle it

A controlled experiment in which the identical backbone trained with full-resolution pixel-space video prediction produces markedly higher success rates on LIBERO or RoboTwin 2.0 than the downsampled-latent version would falsify the claim that the efficiency trade-off preserves task performance.

Figures

Figures reproduced from arXiv: 2606.08242 by Dongzhou Cheng, Jiaqi Wang, Juan Wang, Lingxuan Weng, Shiyue Wang, Xiaoyang Xu, Yibin Wang, Ziang Li.

**Figure 1.** Figure 1: Overview of Light-WAM. Light-WAM shares an adapted video backbone between video co-training and action prediction. During training, the video branch applies future-video supervision to downsampled latent videos z¯vid, reducing the token cost of temporal supervision. The action prediction branch runs in both training and inference: it takes the current observation latent zact and predicts action chunks with… view at source ↗

**Figure 2.** Figure 2: RoboTwin 2.0 inference efficiency-performance comparison. conditions. This setting is more challenging for a lightweight model such as Light-WAM, with only 0.44B trainable parameters and a direct action head rather than large generative action experts. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative analysis. Top: future-video predictions compared with reference rollout frames at t = {+8, +16, +24, +32}. Bottom: learned-query visualizations from the StateFusionActionExpert. 4.6 Qualitative Analysis Future video visualization. The top row of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world evaluation. Robot setup and success rates on three dual-arm tasks. 5 Conclusion and Limitations We presented Light-WAM, a lightweight World Action Model for efficient robot manipulation. By combining a compact video backbone, downsampled latent-space video supervision, and the StateFusionActionExpert, Light-WAM improves the efficiency of both WAM training and inference. Experiments on LIBERO, R… view at source ↗

**Figure 5.** Figure 5: Additional real-world rollouts. Rollout frames on three dual-arm tasks and future-video predictions compared with ground-truth future frames. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Light-WAM adds a StateFusionActionExpert for direct chunk prediction and cuts costs via downsampled latent supervision, but the performance numbers rest on thin evidence.

read the letter

The main takeaway is that Light-WAM pairs a compact video backbone with downsampled latent-space video supervision and a new StateFusionActionExpert that fuses multi-layer states through learned-query pooling to output action chunks in one pass.

The StateFusionActionExpert is the clearest new piece. It gives a direct, non-generative interface from the backbone to robot actions, which is not described in the earlier WAM papers cited. That design choice supports the efficiency goal without adding heavy action heads.

The paper shows concrete efficiency numbers: 0.44B trainable parameters, 72 ms inference, and 4.1 GiB peak memory, while reporting strong LIBERO results and usable multi-task performance on RoboTwin 2.0. The downsampled supervision and single-pass prediction are straightforward ways to lower training and deployment cost.

The soft spot is the lack of detail around the central assumption. Downsampling the latent video space is meant to keep representation benefits while cutting cost, but if it removes fine spatial cues needed for precise manipulation, the StateFusionActionExpert cannot recover them. The abstract gives no ablations on the downsampling factor, no error bars, and no baseline comparisons, so it is difficult to judge whether the reported numbers hold up or whether new failure modes appear on contact-rich tasks.

This work is for people building deployable closed-loop policies who already follow WAM-style training and want lower compute. Readers focused on architecture tweaks for efficiency will find the expert design useful. The new component is concrete enough that the paper deserves a serious referee, though the experiments will need more scrutiny to support the claims.

Referee Report

3 major / 2 minor

Summary. The paper proposes Light-WAM, a lightweight World Action Model for robot manipulation. It uses a compact video backbone with future-video supervision performed exclusively in a downsampled latent space to reduce co-training costs while aiming to retain representation-learning benefits. Action prediction is handled by the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them via learned-query pooling, and predicts action chunks in a single forward pass. Experiments claim that this yields strong performance on LIBERO, usable multi-task results on RoboTwin 2.0, using only 0.44B trainable parameters, with 72.03 ms inference latency and 4.1 GiB peak GPU memory.

Significance. If the performance and efficiency claims hold under scrutiny, the work could meaningfully advance deployable closed-loop policies by lowering the computational overhead of world-action models, particularly through the single-pass StateFusionActionExpert design. The explicit identification of the downsampling factor and backbone-layer count as free parameters is a positive step toward transparency.

major comments (3)

[§3] §3 (downsampling for latent video space): the central efficiency claim rests on future-video supervision in the downsampled latent space preserving task-relevant temporal and spatial structure. No ablation or sensitivity analysis is reported on the downsampling factor (listed as a free parameter), leaving open whether fine-grained cues required for contact-rich manipulation are lost.
[Results] Results section (LIBERO and RoboTwin tables): performance is reported as 'strong' and 'usable' without error bars, number of seeds, or explicit baseline comparisons and data-exclusion details. This makes it impossible to assess whether the 0.44B-parameter model reliably matches or exceeds prior WAMs.
[StateFusionActionExpert] StateFusionActionExpert description: the learned-query pooling is presented as the efficient interface, yet no controlled comparison to simpler concatenation or attention baselines is given to establish that this component is necessary for the reported latency and memory figures.

minor comments (2)

[Abstract] Abstract: 'improved training throughput' is stated without a quantitative comparison or reference to a specific baseline.
[Abstract] Notation: the term 'StateFusionActionExpert' appears in the abstract before its definition, which may reduce readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (downsampling for latent video space): the central efficiency claim rests on future-video supervision in the downsampled latent space preserving task-relevant temporal and spatial structure. No ablation or sensitivity analysis is reported on the downsampling factor (listed as a free parameter), leaving open whether fine-grained cues required for contact-rich manipulation are lost.

Authors: We agree that an explicit sensitivity analysis on the downsampling factor would better substantiate the claim that task-relevant structure is retained for contact-rich tasks. Although the factor is identified as a free parameter, we will add an ablation study in the revised manuscript varying this factor and reporting effects on LIBERO and RoboTwin performance. revision: yes
Referee: [Results] Results section (LIBERO and RoboTwin tables): performance is reported as 'strong' and 'usable' without error bars, number of seeds, or explicit baseline comparisons and data-exclusion details. This makes it impossible to assess whether the 0.44B-parameter model reliably matches or exceeds prior WAMs.

Authors: We acknowledge that statistical reporting with error bars and seed counts is necessary for assessing reliability. The experiments used multiple seeds; we will revise the results section to include error bars, specify the number of seeds, add explicit baseline comparisons, and clarify data-exclusion details. revision: yes
Referee: [StateFusionActionExpert] StateFusionActionExpert description: the learned-query pooling is presented as the efficient interface, yet no controlled comparison to simpler concatenation or attention baselines is given to establish that this component is necessary for the reported latency and memory figures.

Authors: The learned-query pooling in StateFusionActionExpert is intended to provide an efficient interface. We did not include direct comparisons in the original submission. We will add controlled ablations against concatenation and attention baselines in the revision to demonstrate its contribution to the reported latency and memory figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on empirical benchmarks

full rationale

The paper describes an architectural proposal (compact video backbone + downsampled latent supervision + StateFusionActionExpert) whose central claims are validated through training and evaluation on LIBERO and RoboTwin 2.0. No equations, fitted parameters, or self-citations are presented as deriving the reported metrics (0.44B parameters, latency, success rates) by construction. Design choices are justified by efficiency arguments and experimental outcomes rather than definitional equivalence or load-bearing self-citation chains. This is the expected outcome for an empirical systems paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The performance claims rest on standard neural-network training assumptions plus the untested premise that latent-space video prediction preserves task-relevant temporal structure; no new physical entities are introduced.

free parameters (2)

downsampling factor for latent video space
Chosen to reduce supervision cost; value not stated in abstract but directly affects whether representation benefits are retained.
number of backbone layers read by StateFusionActionExpert
Architectural hyperparameter that controls the fusion mechanism and is fitted during training.

axioms (1)

domain assumption Standard supervised learning with video prediction auxiliary loss improves policy representations
Invoked when claiming that latent-space supervision retains the benefits of full WAM co-training.

invented entities (1)

StateFusionActionExpert no independent evidence
purpose: Efficient interface that fuses multi-layer states via learned-query pooling to predict action chunks directly
New module introduced to avoid heavy generative action experts; no independent evidence outside the model itself.

pith-pipeline@v0.9.1-grok · 5772 in / 1391 out tokens · 15685 ms · 2026-06-27T19:54:07.514232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 25 linked inside Pith

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[6]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025
[7]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025
[8]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025
[9]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025
[10]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[11]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[12]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[13]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[14]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[15]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[16]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 9

Pith/arXiv arXiv 2025
[17]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025
[18]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[19]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[20]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[21]

Y . Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, et al. Vla- adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

2026
[22]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023
[23]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024
[24]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024
[25]

S. Zhou, Y . Du, J. Chen, Y . Li, D.-Y . Yeung, and C. Gan. Robodreamer: Learning composi- tional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Pith/arXiv arXiv 2024
[26]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024
[27]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025
[28]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[29]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[30]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[31]

J. Lee, Y . Lee, J. Kim, A. Kosiorek, S. Choi, and Y . W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

2019
[32]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

2023
[33]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[34]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

Pith/arXiv arXiv 2016
[35]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[36]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[37]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026
[38]

#6: 7!/80995!*()-0**'(./*0-!

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 11 A Algorithmic Details Backbone adaptation.Light-W AM uses Wan2.1-T2V-1.3B as the video backbone and keeps the pretrained backbone weights frozen. We ad...

Pith/arXiv arXiv 2025

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[3] [3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[4] [4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[5] [5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[6] [6]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025

[7] [7]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025

[8] [8]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025

[9] [9]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025

[10] [10]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[11] [11]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[12] [12]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[13] [13]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[14] [14]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[15] [15]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[16] [16]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 9

Pith/arXiv arXiv 2025

[17] [17]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025

[18] [18]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[19] [19]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[20] [20]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[21] [21]

Y . Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, et al. Vla- adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

2026

[22] [22]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023

[23] [23]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024

[24] [24]

Bharadhwaj, D

H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024

[25] [25]

S. Zhou, Y . Du, J. Chen, Y . Li, D.-Y . Yeung, and C. Gan. Robodreamer: Learning composi- tional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Pith/arXiv arXiv 2024

[26] [26]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024

[27] [27]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025

[28] [28]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[29] [29]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[30] [30]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[31] [31]

J. Lee, Y . Lee, J. Kim, A. Kosiorek, S. Choi, and Y . W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

2019

[32] [32]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

2023

[33] [33]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[34] [34]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

Pith/arXiv arXiv 2016

[35] [35]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[36] [36]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[37] [37]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026

[38] [38]

#6: 7!/80995!*()-0**'(./*0-!

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 11 A Algorithmic Details Backbone adaptation.Light-W AM uses Wan2.1-T2V-1.3B as the video backbone and keeps the pretrained backbone weights frozen. We ad...

Pith/arXiv arXiv 2025