pith. sign in

arxiv: 2606.08242 · v1 · pith:WBU5L4WAnew · submitted 2026-06-06 · 💻 cs.CV

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Pith reviewed 2026-06-27 19:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords robot manipulationworld action modelsefficient policieslatent video supervisionstate fusionaction chunk predictionmulti-task learning
0
0 comments X

The pith

A lightweight world action model achieves competitive robot manipulation performance with 0.44B parameters by supervising future video only in downsampled latent space and using state-fusion action decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World Action Models add future prediction to robot policy training so that representations capture task-relevant temporal structure. Existing versions rely on large generative networks that raise training cost and slow inference, limiting closed-loop deployment. Light-WAM replaces those heavy components with a compact video backbone whose future-video loss is computed only after downsampling into latent space. A new StateFusionActionExpert then pools adapted states from several backbone layers through learned queries and outputs action chunks in one pass. The resulting system keeps the representation benefit of video co-training while cutting parameter count, latency, and memory enough for practical multi-task use on standard benchmarks.

Core claim

Light-WAM demonstrates that future-video supervision performed in a downsampled latent space, paired with a StateFusionActionExpert that fuses multi-layer states via learned-query pooling, allows a 0.44B-parameter model to retain strong performance on LIBERO and deliver usable multi-task results on RoboTwin 2.0, while reaching 72.03 ms inference latency and 4.1 GiB peak GPU memory.

What carries the argument

The StateFusionActionExpert, which reads adapted states from multiple layers of a compact video backbone, fuses them through learned-query pooling, and directly outputs action chunks in a single forward pass.

If this is right

  • Maintains strong performance on the LIBERO benchmark.
  • Achieves usable multi-task performance on RoboTwin 2.0.
  • Requires only 0.44B trainable parameters.
  • Reaches 72.03 ms inference latency with 4.1 GiB peak GPU memory.
  • Improves training throughput relative to prior heavy generative WAMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of video representation learning from direct action prediction could let the same backbone serve multiple robot embodiments without retraining the full model.
  • Downsampling the supervision signal may permit longer prediction horizons or higher frame rates without proportional growth in compute.
  • If the latent-space objective generalizes, similar efficiency gains might appear in other vision-based control domains that currently rely on full generative video models.

Load-bearing premise

That performing future-video supervision only in a downsampled latent space retains the representation-learning benefits of full video co-training without introducing new failure modes on manipulation tasks.

What would settle it

A controlled experiment in which the identical backbone trained with full-resolution pixel-space video prediction produces markedly higher success rates on LIBERO or RoboTwin 2.0 than the downsampled-latent version would falsify the claim that the efficiency trade-off preserves task performance.

Figures

Figures reproduced from arXiv: 2606.08242 by Dongzhou Cheng, Jiaqi Wang, Juan Wang, Lingxuan Weng, Shiyue Wang, Xiaoyang Xu, Yibin Wang, Ziang Li.

Figure 1
Figure 1. Figure 1: Overview of Light-WAM. Light-WAM shares an adapted video backbone between video co-training and action prediction. During training, the video branch applies future-video supervision to downsampled latent videos z¯vid, reducing the token cost of temporal supervision. The action prediction branch runs in both training and inference: it takes the current observation latent zact and predicts action chunks with… view at source ↗
Figure 2
Figure 2. Figure 2: RoboTwin 2.0 inference efficiency-performance comparison. conditions. This setting is more challenging for a lightweight model such as Light-WAM, with only 0.44B trainable parameters and a direct action head rather than large generative action experts. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative analysis. Top: future-video predictions compared with reference rollout frames at t = {+8, +16, +24, +32}. Bottom: learned-query visualizations from the StateFusionActionExpert. 4.6 Qualitative Analysis Future video visualization. The top row of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world evaluation. Robot setup and success rates on three dual-arm tasks. 5 Conclusion and Limitations We presented Light-WAM, a lightweight World Action Model for efficient robot manipulation. By combining a compact video backbone, downsampled latent-space video supervision, and the StateFusionActionExpert, Light-WAM improves the efficiency of both WAM training and infer￾ence. Experiments on LIBERO, R… view at source ↗
Figure 5
Figure 5. Figure 5: Additional real-world rollouts. Rollout frames on three dual-arm tasks and future-video predictions compared with ground-truth future frames. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Light-WAM, a lightweight World Action Model for robot manipulation. It uses a compact video backbone with future-video supervision performed exclusively in a downsampled latent space to reduce co-training costs while aiming to retain representation-learning benefits. Action prediction is handled by the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them via learned-query pooling, and predicts action chunks in a single forward pass. Experiments claim that this yields strong performance on LIBERO, usable multi-task results on RoboTwin 2.0, using only 0.44B trainable parameters, with 72.03 ms inference latency and 4.1 GiB peak GPU memory.

Significance. If the performance and efficiency claims hold under scrutiny, the work could meaningfully advance deployable closed-loop policies by lowering the computational overhead of world-action models, particularly through the single-pass StateFusionActionExpert design. The explicit identification of the downsampling factor and backbone-layer count as free parameters is a positive step toward transparency.

major comments (3)
  1. [§3] §3 (downsampling for latent video space): the central efficiency claim rests on future-video supervision in the downsampled latent space preserving task-relevant temporal and spatial structure. No ablation or sensitivity analysis is reported on the downsampling factor (listed as a free parameter), leaving open whether fine-grained cues required for contact-rich manipulation are lost.
  2. [Results] Results section (LIBERO and RoboTwin tables): performance is reported as 'strong' and 'usable' without error bars, number of seeds, or explicit baseline comparisons and data-exclusion details. This makes it impossible to assess whether the 0.44B-parameter model reliably matches or exceeds prior WAMs.
  3. [StateFusionActionExpert] StateFusionActionExpert description: the learned-query pooling is presented as the efficient interface, yet no controlled comparison to simpler concatenation or attention baselines is given to establish that this component is necessary for the reported latency and memory figures.
minor comments (2)
  1. [Abstract] Abstract: 'improved training throughput' is stated without a quantitative comparison or reference to a specific baseline.
  2. [Abstract] Notation: the term 'StateFusionActionExpert' appears in the abstract before its definition, which may reduce readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (downsampling for latent video space): the central efficiency claim rests on future-video supervision in the downsampled latent space preserving task-relevant temporal and spatial structure. No ablation or sensitivity analysis is reported on the downsampling factor (listed as a free parameter), leaving open whether fine-grained cues required for contact-rich manipulation are lost.

    Authors: We agree that an explicit sensitivity analysis on the downsampling factor would better substantiate the claim that task-relevant structure is retained for contact-rich tasks. Although the factor is identified as a free parameter, we will add an ablation study in the revised manuscript varying this factor and reporting effects on LIBERO and RoboTwin performance. revision: yes

  2. Referee: [Results] Results section (LIBERO and RoboTwin tables): performance is reported as 'strong' and 'usable' without error bars, number of seeds, or explicit baseline comparisons and data-exclusion details. This makes it impossible to assess whether the 0.44B-parameter model reliably matches or exceeds prior WAMs.

    Authors: We acknowledge that statistical reporting with error bars and seed counts is necessary for assessing reliability. The experiments used multiple seeds; we will revise the results section to include error bars, specify the number of seeds, add explicit baseline comparisons, and clarify data-exclusion details. revision: yes

  3. Referee: [StateFusionActionExpert] StateFusionActionExpert description: the learned-query pooling is presented as the efficient interface, yet no controlled comparison to simpler concatenation or attention baselines is given to establish that this component is necessary for the reported latency and memory figures.

    Authors: The learned-query pooling in StateFusionActionExpert is intended to provide an efficient interface. We did not include direct comparisons in the original submission. We will add controlled ablations against concatenation and attention baselines in the revision to demonstrate its contribution to the reported latency and memory figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on empirical benchmarks

full rationale

The paper describes an architectural proposal (compact video backbone + downsampled latent supervision + StateFusionActionExpert) whose central claims are validated through training and evaluation on LIBERO and RoboTwin 2.0. No equations, fitted parameters, or self-citations are presented as deriving the reported metrics (0.44B parameters, latency, success rates) by construction. Design choices are justified by efficiency arguments and experimental outcomes rather than definitional equivalence or load-bearing self-citation chains. This is the expected outcome for an empirical systems paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The performance claims rest on standard neural-network training assumptions plus the untested premise that latent-space video prediction preserves task-relevant temporal structure; no new physical entities are introduced.

free parameters (2)
  • downsampling factor for latent video space
    Chosen to reduce supervision cost; value not stated in abstract but directly affects whether representation benefits are retained.
  • number of backbone layers read by StateFusionActionExpert
    Architectural hyperparameter that controls the fusion mechanism and is fitted during training.
axioms (1)
  • domain assumption Standard supervised learning with video prediction auxiliary loss improves policy representations
    Invoked when claiming that latent-space supervision retains the benefits of full WAM co-training.
invented entities (1)
  • StateFusionActionExpert no independent evidence
    purpose: Efficient interface that fuses multi-layer states via learned-query pooling to predict action chunks directly
    New module introduced to avoid heavy generative action experts; no independent evidence outside the model itself.

pith-pipeline@v0.9.1-grok · 5772 in / 1391 out tokens · 15685 ms · 2026-06-27T19:54:07.514232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 25 linked inside Pith

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  5. [5]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  7. [7]

    Liang, P

    J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  8. [8]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  9. [9]

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

  10. [10]

    H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  11. [11]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  12. [12]

    T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  13. [13]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  14. [14]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  15. [15]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  16. [16]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 9

  17. [17]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  18. [18]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  19. [19]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  20. [20]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  21. [21]

    Y . Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, et al. Vla- adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

  22. [22]

    Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  23. [23]

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

  24. [24]

    Bharadhwaj, D

    H. Bharadhwaj, D. Dwibedi, A. Gupta, S. Tulsiani, C. Doersch, T. Xiao, D. Shah, F. Xia, D. Sadigh, and S. Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  25. [25]

    S. Zhou, Y . Du, J. Chen, Y . Li, D.-Y . Yeung, and C. Gan. Robodreamer: Learning composi- tional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

  26. [26]

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  27. [27]

    Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  28. [28]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  29. [29]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  30. [30]

    Lipman, R

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  31. [31]

    J. Lee, Y . Lee, J. Kim, A. Kosiorek, S. Choi, and Y . W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

  32. [32]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

  33. [33]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  34. [34]

    J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  35. [35]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

  36. [36]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  37. [37]

    Black, M

    K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

  38. [38]

    #6: 7!/80995!*()-0**'(./*0-!

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 11 A Algorithmic Details Backbone adaptation.Light-W AM uses Wan2.1-T2V-1.3B as the video backbone and keeps the pretrained backbone weights frozen. We ad...