arxiv: 2604.06168 · v2 · submitted 2026-04-07 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen , Zixian Gao , Qiao Sun , Yilin Zhao , Yuncong Yang , Yilun Du , Pengsheng Guo , Tsun-Hsuan Wang

show 2 more authors

Yi-Ling Qiao Chuang Gan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:47 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords action imagesworld action modelspolicy learningvideo generationrobot manipulationzero-shot controlmultiview representation

0 comments

The pith

Translating 7-DoF robot actions into pixel-grounded multiview videos lets the video model itself serve as a zero-shot policy without any separate action head.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that encoding control signals as Action Images—multi-view video frames that depict the robot arm's motion directly in pixel space—turns policy learning into a video generation task. This representation keeps actions fully grounded in the same visual space the backbone already understands, so the model can predict future action images to execute behavior. The same network then handles joint video-action generation, action-conditioned video prediction, and action labeling without extra modules. Because the actions stay interpretable and view-consistent, the approach transfers across environments more readily than low-dimensional token encodings. Results on RLBench and real robots indicate this yields the highest zero-shot success rates among compared world action models.

Core claim

Formulating policy learning as multiview video generation via Action Images produces a unified model in which the pretrained video backbone directly outputs control by generating pixel-grounded action sequences, eliminating the need for a separate policy head while supporting video-action joint tasks under one representation.

What carries the argument

Action Images: multi-view video sequences that are grounded in 2D pixels and explicitly track the 7-DoF robot-arm motion.

If this is right

The video backbone alone can execute policies by predicting the next action image sequence.
One model supports control, future video generation, and action labeling without task-specific heads.
Pixel grounding improves viewpoint and environment transfer compared with abstract action tokens.
Zero-shot success rates exceed prior world action models on RLBench and real robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling the underlying video backbone should directly improve policy performance without redesigning action modules.
The same image-based interface could let non-robot video models be repurposed for control by training only on action-image data.
Real-world calibration errors may be easier to debug because failures appear as visible mismatches in the generated action images.

Load-bearing premise

Converting precise 7-DoF joint commands into image sequences must preserve every necessary detail of the motion so that generating those images produces accurate physical actions.

What would settle it

Run the model on a fine-manipulation task where arm motion is partially occluded in the generated action images; success rate falling below that of a model with an explicit low-dimensional action head would falsify the claim that pixel grounding is sufficient.

Figures

Figures reproduced from arXiv: 2604.06168 by Chuang Gan, Haoyu Zhen, Pengsheng Guo, Qiao Sun, Tsun-Hsuan Wang, Yi-Ling Qiao, Yilin Zhao, Yilun Du, Yuncong Yang, Zixian Gao.

**Figure 1.** Figure 1: Action Images turns policy learning as multiview video generation: 7-DoF actions are translated into pixel-grounded action images that explicitly track robot-arm motion, enabling a zero-shot policy directly from a unified video backbone Abstract. World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future stat… view at source ↗

**Figure 2.** Figure 2: Action as image. We convert each 7-DoF robot action into three semantic 3D points (position, normal, and up), project them into image space, and render them as RGB Gaussian heatmaps. The blue channel further encodes gripper openness in the low-response background, producing a pixel-grounded action representation. 3.1 Action as Images Our central idea is to represent robot action in the same modality as rob… view at source ↗

**Figure 3.** Figure 3: Action images decoding. A 2D heatmap point is selected in the main view, lifted to 3D by ray casting and side-view matching, and repeated for all semantic points to recover the original 7-DoF action. 3.2 Action Images Decoding A useful action representation should not only be easy to generate, but easy to decode back into continuous robot control. We therefore design a simple decoding method that maps the … view at source ↗

**Figure 4.** Figure 4: Unified world-action model training. Multi-view video and action latents are packed with text and camera conditions, and trained under diverse mask strategies. so that the model observes, for each view, a unified timeline of (robot video → action video). Multi-view data are processed with shared weights across views, enabling consistent cross-view learning while preserving per-view conditioning. Unified tr… view at source ↗

**Figure 5.** Figure 5: Real-world zero-shot rollouts on xArm robot. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Zero-shot video and action-image generation on FR3M [50] rooms. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Action labeling results [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: provides more qualitative robot manipulation results, mainly on grasping tasks across diverse objects and scenes. Input Image Video Generation Results 3D Visualization Prompt: Place the black bowl in the paper box Prompt: Close the middle drawer Real Exec Real Exec [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Camera control results in complex scenes from the π0 website. Finally, we show action-conditioned generation results in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Action-conditioned generation results [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Action Images turns 7-DoF control into multiview pixel videos so a video backbone can generate policies directly, but the zero-shot claim depends on an untested assumption that image generation and readout preserve precise continuous actions without hidden decoding steps.

read the letter

Action Images reframes robot actions as multiview pixel videos that a pretrained video model generates directly. This lets the backbone serve as the policy itself, without a separate action head or token predictor. The idea stands out because it keeps everything in image space, which could help with viewpoint transfer and reuse of video pretraining weights across tasks like action-conditioned generation and joint video-action modeling.

Referee Report

2 major / 2 minor

Summary. The paper introduces Action Images, a unified world action model that represents 7-DoF robot actions as multiview pixel-grounded images rather than low-dimensional tokens. This formulation turns policy learning into multiview video generation, allowing a pretrained video backbone to function directly as a zero-shot policy without a separate action head or module. The same model supports video-action joint generation, action-conditioned video prediction, and action labeling. Empirical claims include the strongest zero-shot success rates on RLBench and real-robot tasks, plus improved generation quality over prior video-space world models.

Significance. If the core assumption holds, the work offers a promising route to tighter integration between large video models and robot control by making actions interpretable and pixel-grounded. This could improve viewpoint and environment transfer while eliminating auxiliary policy networks. The joint-generation capabilities and reported zero-shot gains on standard benchmarks would constitute a concrete advance over existing WAM approaches that rely on separate action modules.

major comments (2)

[§3] §3 (Method, action-image encoding): The central claim that the video backbone itself acts as a zero-shot policy rests on the assertion that translating continuous 7-DoF poses into multiview pixel images is information-preserving. No reconstruction-error bounds, quantization analysis, or multiview-consistency metrics are provided for the encoding step; without these, it is unclear whether generated images can be read back as precise executable actions without an implicit decoder or calibration step that would contradict the “no separate policy head” statement.
[§4] §4 (Experiments): The abstract and results section assert “strongest zero-shot success rates” on RLBench and real-world evaluations, yet the provided description supplies no baseline details, error bars, data splits, exact success metrics, or statistical significance tests. This absence prevents verification that the reported gains are attributable to the pixel-grounded representation rather than implementation specifics or evaluation choices.

minor comments (2)

[Figure 2 / §3.2] Figure captions and §3.2: Add explicit diagrams or pseudocode showing the exact pixel-to-7-DoF decoding procedure used at inference time so readers can confirm it requires no learned components.
[§2] Related-work section: The discussion of prior WAMs could more precisely contrast the proposed multiview action-image representation against token-based or latent-action approaches cited in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate additional analysis and experimental details where needed.

read point-by-point responses

Referee: [§3] §3 (Method, action-image encoding): The central claim that the video backbone itself acts as a zero-shot policy rests on the assertion that translating continuous 7-DoF poses into multiview pixel images is information-preserving. No reconstruction-error bounds, quantization analysis, or multiview-consistency metrics are provided for the encoding step; without these, it is unclear whether generated images can be read back as precise executable actions without an implicit decoder or calibration step that would contradict the “no separate policy head” statement.

Authors: We appreciate this observation on the encoding step. The action images encode 7-DoF poses via explicit pixel-grounded markers and trajectories in multiview renders, enabling direct geometric readout of actions from generated pixels without any learned decoder or auxiliary policy module. To address the request for quantitative support, the revised manuscript adds reconstruction-error analysis, quantization bounds, and multiview-consistency metrics in the supplementary material. These confirm that the representation is sufficiently information-preserving for executable actions while preserving the zero-shot policy claim. revision: yes
Referee: [§4] §4 (Experiments): The abstract and results section assert “strongest zero-shot success rates” on RLBench and real-world evaluations, yet the provided description supplies no baseline details, error bars, data splits, exact success metrics, or statistical significance tests. This absence prevents verification that the reported gains are attributable to the pixel-grounded representation rather than implementation specifics or evaluation choices.

Authors: We agree that fuller experimental reporting is required for independent verification. The revised manuscript expands Section 4 with complete baseline descriptions, error bars from multiple seeds, data-split specifications, precise success-rate definitions, and statistical significance tests. These additions demonstrate that the reported zero-shot gains are attributable to the Action Images formulation rather than other factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent evaluations

full rationale

The paper proposes translating 7-DoF actions into multiview action images as a design choice to enable direct use of a pretrained video backbone for policy execution. This representation is introduced independently, and the zero-shot policy claim is grounded in reported success rates on RLBench and real-world tasks rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the output to the input by construction; the method remains falsifiable through external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that video models can handle action images as part of their generation process without explicit free parameters listed; introduces Action Images as a new representation.

axioms (1)

domain assumption Pretrained video backbones can be directly repurposed for policy generation when actions are encoded as pixel images.
Invoked in the description of the unified model acting as zero-shot policy.

invented entities (1)

Action Images no independent evidence
purpose: Multi-view pixel-grounded video representation of 7-DoF robot actions for unified generation and control.
New concept introduced to ground actions in 2D pixels and track arm motion explicitly.

pith-pipeline@v0.9.0 · 5548 in / 1145 out tokens · 46732 ms · 2026-05-10T18:47:59.023029+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We convert each 7-DoF robot action into three semantic 3D points... render them as RGB Gaussian heatmaps... unified video-space representation of observation and action.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion... zero-shot policy without a separate policy head

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

Reference graph

Works this paper leans on

74 extracted references · 42 canonical work pages · cited by 1 Pith paper · 20 internal anchors

[1]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

work page internal anchor Pith review arXiv 2025
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review arXiv 2025
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to- 4d generation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7996–8006 (2024)

2024
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025
[5]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world mod- els. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 15791–15801 (2025)

2025
[6]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:pi_0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review arXiv 2022
[8]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024),https://openai.com/ research/video-generation-models-as-world-simulators

2024
[9]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024
[10]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffu- sion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025
[11]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Car- reira, J., Zisserman, A.: Tapir: Tracking any point with per-frame initial- ization and temporal refinement. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 10061–10072 (2023) Action Images: End-to-End Policy Learning via Multiview Video Generation 21

2023
[12]

Advances in neural information processing systems36, 9156–9172 (2023)

Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

2023
[13]

In: 2025 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Etukuru, H., Naka, N., Hu, Z., Lee, S., Mehu, J., Edsinger, A., Paxton, C., Chintala, S., Pinto, L., Shafiullah, N.M.M.: Robot utility models: General policies for zero-shot deployment in new environments. In: 2025 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 8275–8283. IEEE (2025)

2025
[14]

40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

Fang, J., Zhao, S.: Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719 (2024)

work page arXiv 2024
[15]

PDF (2025), https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech- Report.pdf, accessed: 2026-03-05

Google DeepMind: Veo: a text-to-video generation system. PDF (2025), https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech- Report.pdf, accessed: 2026-03-05

2025
[16]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

Guo, J., Ma, X., Wang, Y., Yang, M., Liu, H., Li, Q.: Flowdreamer: A rgb-d worldmodelwithflow-basedmotionrepresentationsforrobotmanipulation. arXiv preprint arXiv:2505.10075 (2025)

work page arXiv 2025
[17]

Ctrl-world: A controllable generative world model for robot manipulation, 2026

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)

work page arXiv 2025
[18]

Advances in neural information processing systems31(2018)

Gupta, A., Murali, A., Gandhi, D.P., Pinto, L.: Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems31(2018)

2018
[19]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)

work page internal anchor Pith review arXiv 2024
[21]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., et al.:π∗0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

work page Pith review arXiv 2025
[22]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a vision- language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

2020
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jin, Y., Peng, S., Wang, X., Xie, T., Xu, Z., Yang, Y., Shen, Y., Bao, H., Zhou, X.: Diffuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11047–11057 (2025)

2025
[25]

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025) 22 H. Zhen et al

2025
[26]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review arXiv 2024
[27]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)

work page internal anchor Pith review arXiv 2026
[28]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Tenenbaum

Ko, P.C., Mao, J., Du, Y., Sun, S.H., Tenenbaum, J.B.: Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576 (2023)

work page arXiv 2023
[31]

MolmoAct: Action Reasoning Models that can Reason in Space

Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., et al.: Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 (2025)

work page internal anchor Pith review arXiv 2025
[32]

Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025

Li, C., Krause, A., Hutter, M.: Robotic world model: A neural net- work simulator for robust policy optimization in robotics. arXiv preprint arXiv:2501.10100 (2025)

work page arXiv 2025
[33]

Li, P., Chen, Y., Xu, Y., Yang, J., Wu, X., Guo, J., Sun, N., Qian, L., Li, X., Xiao, X., Liu, J., Liu, N., Kong, T., Huang, Y., Wang, L., Tan, T.: Multi-view video diffusion policy: A 3d spatio-temporal-aware video action model (2026),https://arxiv.org/abs/2604.03181

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Unified Video Action Model

Li, S., Gao, Y., Sadigh, D., Song, S.: Unified video action model. arXiv preprint arXiv:2503.00200 (2025)

work page internal anchor Pith review arXiv 2025
[35]

arXiv preprint arXiv:2511.04131 (2025)

Li, Y., Luo, Z., Zhang, T., Dai, C., Kanervisto, A., Tirinzoni, A., Weng, H., Kitani, K., Guzek, M., Touati, A., et al.: Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv preprint arXiv:2511.04131 (2025)

work page arXiv 2025
[36]

ACM Transactions on Graphics (TOG)44(6), 1–12 (2025)

Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. ACM Transactions on Graphics (TOG)44(6), 1–12 (2025)

2025
[37]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Liang, J., Tokmakov, P., Liu, R., Sudhakar, S., Shah, P., Ambrus, R., Vondrick, C.: Video generators are robot policies. arXiv preprint arXiv:2508.00795 (2025)

work page arXiv 2025
[38]

Lightricks:Ltxstudio.Online(2024),https://app.ltx.studio/,accessed: 2026-02

2024
[39]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Geometry-aware 4d video generation for robot manipulation.CoRR, abs/2507.01099, 2025

Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)

work page arXiv 2025
[41]

Ljung,L.,Glad,T.:Modelingofdynamicsystems.Prentice-Hall,Inc.(1994) Action Images: End-to-End Policy Learning via Multiview Video Generation 23

1994
[42]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Nakamoto, M., Mees, O., Kumar, A., Levine, S.: Steering your generalists: Improving robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816 (2024)

work page arXiv 2024
[43]

arXiv preprint arXiv:2401.08742 , year=

Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Efficient4d: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)

work page arXiv 2024
[44]

Imitating human behaviour with diffusion models

Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S.V., Tan, S.Z., Momennejad, I., Hofmann, K., et al.: Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677 (2023)

work page arXiv 2023
[45]

on a new geometry of space

Plucker, J.: Xvii. on a new geometry of space. Philosophical Transactions of the Royal Society of London (155), 725–791 (1865)
[46]

Journal of Intelligent & Robotic Systems 86(2), 153–173 (2017)

Polydoros,A.S.,Nalpantidis,L.:Surveyofmodel-basedreinforcementlearn- ing: Applications on robotics. Journal of Intelligent & Robotic Systems 86(2), 153–173 (2017)

2017
[47]

arXiv preprint arXiv:2402.08191

Pumacay, W., Singh, I., Duan, J., Krishna, R., Thomason, J., Fox, D.: The colosseum: A benchmark for evaluating generalization for robotic manipu- lation. arXiv preprint arXiv:2402.08191 (2024)

work page arXiv 2024
[48]

In: SC20: international confer- ence for high performance computing, networking, storage and analysis

Rajbhandari,S.,Rasley,J.,Ruwase,O.,He,Y.:Zero:Memoryoptimizations toward training trillion parameter models. In: SC20: international confer- ence for high performance computing, networking, storage and analysis. pp. 1–16. IEEE (2020)

2020
[49]

arXiv preprint arXiv:2312.17142 , year=

Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaus- sian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)

work page arXiv 2023
[50]

Distilled feature fields enable few-shot language-guided manipulation

Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931 (2023)

work page arXiv 2023
[51]

arXiv preprint arXiv:2301.11280 , year=

Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., et al.: Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023)

work page arXiv 2023
[52]

arXiv preprint arXiv:2508.20840 (2025)

Sun, Q., Yang, L., Tang, W., Huang, W., Xu, K., Chen, Y., Liu, M., Yang, J., Zhu, H., Wang, Y., et al.: Learning primitive embodied world models: Towards scalable robotic learning. arXiv preprint arXiv:2508.20840 (2025)

work page arXiv 2025
[53]

ACM Sigart Bulletin2(4), 160–163 (1991)

Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin2(4), 160–163 (1991)

1991
[54]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review arXiv 2024
[55]

In: Conference on Robot Learning

Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–
[56]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025) 24 H. Zhen et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[58]

Applied AI Letters2(4), e52 (2021)

Watkins,O.,Huang,S.,Frost,J.,Bhatia,K.,Weiner,E.,Abbeel,P.,Darrell, T., Plummer, B., Saenko, K., Dragan, A.: Explaining robot policies. Applied AI Letters2(4), e52 (2021)

2021
[59]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

In: Conference on robot learning

Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: Conference on robot learning. pp. 2226–2240. PMLR (2023)

2023
[61]

Barron, and Aleksander Holynski

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holyn- ski, A.: CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models. arXiv:2411.18613 (2024)

work page arXiv 2024
[62]

SV4D: Dynamic 3d content genera- tion with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024)

work page arXiv 2024
[63]

Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426,

Xing, Y., Luo, X., Xie, J., Gao, L., Shen, H., Song, J.: Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation. arXiv preprint arXiv:2508.06426 (2025)

work page arXiv 2025
[64]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331–20341 (2024)

2024
[65]

World Action Models are Zero-shot Policies

Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)

work page internal anchor Pith review arXiv 2026
[66]

Advances in Neural Information Processing Systems37, 15272–15295 (2024)

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation. Advances in Neural Information Processing Systems37, 15272–15295 (2024)

2024
[67]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Zhang, W., Li, Y., Qiao, Y., Huang, S., Liu, J., Dayoub, F., Ma, X., Liu, L.: Effective tuning strategies for generalist robot manipulation policies. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 7255–7262. IEEE (2025)

2025
[68]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Zhang,Z.,Liao,J.,Li,M.,Dai,Z.,Qiu,B.,Zhu,S.,Qin,L.,Wang,W.:Tora: Trajectory-oriented diffusion transformer for video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 2063– 2073 (2025)

2063
[69]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

work page internal anchor Pith review arXiv 2024
[70]

Tesseract: Learning 4d embodied world models, 2025

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025
[71]

In: Proceedings of the Action Images: End-to-End Policy Learning via Multiview Video Generation 25 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learn- ing a generalist model for embodied navigation. In: Proceedings of the Action Images: End-to-End Policy Learning via Multiview Video Generation 25 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624–13634 (2024)

2024
[72]

Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024)

work page arXiv 2024
[73]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

2025
[74]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

2023