Recognition: 2 theorem links
· Lean TheoremAction Images: End-to-End Policy Learning via Multiview Video Generation
Pith reviewed 2026-05-10 18:47 UTC · model grok-4.3
The pith
Translating 7-DoF robot actions into pixel-grounded multiview videos lets the video model itself serve as a zero-shot policy without any separate action head.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formulating policy learning as multiview video generation via Action Images produces a unified model in which the pretrained video backbone directly outputs control by generating pixel-grounded action sequences, eliminating the need for a separate policy head while supporting video-action joint tasks under one representation.
What carries the argument
Action Images: multi-view video sequences that are grounded in 2D pixels and explicitly track the 7-DoF robot-arm motion.
If this is right
- The video backbone alone can execute policies by predicting the next action image sequence.
- One model supports control, future video generation, and action labeling without task-specific heads.
- Pixel grounding improves viewpoint and environment transfer compared with abstract action tokens.
- Zero-shot success rates exceed prior world action models on RLBench and real robots.
Where Pith is reading between the lines
- Scaling the underlying video backbone should directly improve policy performance without redesigning action modules.
- The same image-based interface could let non-robot video models be repurposed for control by training only on action-image data.
- Real-world calibration errors may be easier to debug because failures appear as visible mismatches in the generated action images.
Load-bearing premise
Converting precise 7-DoF joint commands into image sequences must preserve every necessary detail of the motion so that generating those images produces accurate physical actions.
What would settle it
Run the model on a fine-manipulation task where arm motion is partially occluded in the generated action images; success rate falling below that of a model with an explicit low-dimensional action head would falsify the claim that pixel grounding is sufficient.
Figures
read the original abstract
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Action Images, a unified world action model that represents 7-DoF robot actions as multiview pixel-grounded images rather than low-dimensional tokens. This formulation turns policy learning into multiview video generation, allowing a pretrained video backbone to function directly as a zero-shot policy without a separate action head or module. The same model supports video-action joint generation, action-conditioned video prediction, and action labeling. Empirical claims include the strongest zero-shot success rates on RLBench and real-robot tasks, plus improved generation quality over prior video-space world models.
Significance. If the core assumption holds, the work offers a promising route to tighter integration between large video models and robot control by making actions interpretable and pixel-grounded. This could improve viewpoint and environment transfer while eliminating auxiliary policy networks. The joint-generation capabilities and reported zero-shot gains on standard benchmarks would constitute a concrete advance over existing WAM approaches that rely on separate action modules.
major comments (2)
- [§3] §3 (Method, action-image encoding): The central claim that the video backbone itself acts as a zero-shot policy rests on the assertion that translating continuous 7-DoF poses into multiview pixel images is information-preserving. No reconstruction-error bounds, quantization analysis, or multiview-consistency metrics are provided for the encoding step; without these, it is unclear whether generated images can be read back as precise executable actions without an implicit decoder or calibration step that would contradict the “no separate policy head” statement.
- [§4] §4 (Experiments): The abstract and results section assert “strongest zero-shot success rates” on RLBench and real-world evaluations, yet the provided description supplies no baseline details, error bars, data splits, exact success metrics, or statistical significance tests. This absence prevents verification that the reported gains are attributable to the pixel-grounded representation rather than implementation specifics or evaluation choices.
minor comments (2)
- [Figure 2 / §3.2] Figure captions and §3.2: Add explicit diagrams or pseudocode showing the exact pixel-to-7-DoF decoding procedure used at inference time so readers can confirm it requires no learned components.
- [§2] Related-work section: The discussion of prior WAMs could more precisely contrast the proposed multiview action-image representation against token-based or latent-action approaches cited in the text.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate additional analysis and experimental details where needed.
read point-by-point responses
-
Referee: [§3] §3 (Method, action-image encoding): The central claim that the video backbone itself acts as a zero-shot policy rests on the assertion that translating continuous 7-DoF poses into multiview pixel images is information-preserving. No reconstruction-error bounds, quantization analysis, or multiview-consistency metrics are provided for the encoding step; without these, it is unclear whether generated images can be read back as precise executable actions without an implicit decoder or calibration step that would contradict the “no separate policy head” statement.
Authors: We appreciate this observation on the encoding step. The action images encode 7-DoF poses via explicit pixel-grounded markers and trajectories in multiview renders, enabling direct geometric readout of actions from generated pixels without any learned decoder or auxiliary policy module. To address the request for quantitative support, the revised manuscript adds reconstruction-error analysis, quantization bounds, and multiview-consistency metrics in the supplementary material. These confirm that the representation is sufficiently information-preserving for executable actions while preserving the zero-shot policy claim. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and results section assert “strongest zero-shot success rates” on RLBench and real-world evaluations, yet the provided description supplies no baseline details, error bars, data splits, exact success metrics, or statistical significance tests. This absence prevents verification that the reported gains are attributable to the pixel-grounded representation rather than implementation specifics or evaluation choices.
Authors: We agree that fuller experimental reporting is required for independent verification. The revised manuscript expands Section 4 with complete baseline descriptions, error bars from multiple seeds, data-split specifications, precise success-rate definitions, and statistical significance tests. These additions demonstrate that the reported zero-shot gains are attributable to the Action Images formulation rather than other factors. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent evaluations
full rationale
The paper proposes translating 7-DoF actions into multiview action images as a design choice to enable direct use of a pretrained video backbone for policy execution. This representation is introduced independently, and the zero-shot policy claim is grounded in reported success rates on RLBench and real-world tasks rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the output to the input by construction; the method remains falsifiable through external benchmarks without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained video backbones can be directly repurposed for policy generation when actions are encoded as pixel images.
invented entities (1)
-
Action Images
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We convert each 7-DoF robot action into three semantic 3D points... render them as RGB Gaussian heatmaps... unified video-space representation of observation and action.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion... zero-shot policy without a separate policy head
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Reference graph
Works this paper leans on
-
[1]
World Simulation with Video Foundation Models for Physical AI
Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)
work page internal anchor Pith review arXiv 2025
-
[2]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to- 4d generation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7996–8006 (2024)
2024
-
[4]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)
2025
-
[5]
In: Proceedings of the Computer Vision and Pattern Recognition Con- ference
Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world mod- els. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 15791–15801 (2025)
2025
-
[6]
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:pi_0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)
work page internal anchor Pith review arXiv 2022
-
[8]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024),https://openai.com/ research/video-generation-models-as-world-simulators
2024
-
[9]
In: Forty-first International Conference on Machine Learning (2024)
Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)
2024
-
[10]
The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffu- sion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
2025
-
[11]
In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision
Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Car- reira, J., Zisserman, A.: Tapir: Tracking any point with per-frame initial- ization and temporal refinement. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 10061–10072 (2023) Action Images: End-to-End Policy Learning via Multiview Video Generation 21
2023
-
[12]
Advances in neural information processing systems36, 9156–9172 (2023)
Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)
2023
-
[13]
In: 2025 IEEE Inter- national Conference on Robotics and Automation (ICRA)
Etukuru, H., Naka, N., Hu, Z., Lee, S., Mehu, J., Edsinger, A., Paxton, C., Chintala, S., Pinto, L., Shafiullah, N.M.M.: Robot utility models: General policies for zero-shot deployment in new environments. In: 2025 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 8275–8283. IEEE (2025)
2025
-
[14]
40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel
Fang, J., Zhao, S.: Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719 (2024)
-
[15]
PDF (2025), https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech- Report.pdf, accessed: 2026-03-05
Google DeepMind: Veo: a text-to-video generation system. PDF (2025), https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech- Report.pdf, accessed: 2026-03-05
2025
-
[16]
Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025
Guo, J., Ma, X., Wang, Y., Yang, M., Liu, H., Li, Q.: Flowdreamer: A rgb-d worldmodelwithflow-basedmotionrepresentationsforrobotmanipulation. arXiv preprint arXiv:2505.10075 (2025)
-
[17]
Ctrl-world: A controllable generative world model for robot manipulation, 2026
Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)
-
[18]
Advances in neural information processing systems31(2018)
Gupta, A., Murali, A., Gandhi, D.P., Pinto, L.: Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems31(2018)
2018
-
[19]
Classifier-Free Diffusion Guidance
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)
work page internal anchor Pith review arXiv 2024
-
[21]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., et al.:π∗0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)
work page Pith review arXiv 2025
-
[22]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a vision- language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)
James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)
2020
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jin, Y., Peng, S., Wang, X., Xie, T., Xu, Z., Yang, Y., Shen, Y., Bao, H., Zhou, X.: Diffuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11047–11057 (2025)
2025
-
[25]
Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025) 22 H. Zhen et al
2025
-
[26]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)
work page internal anchor Pith review arXiv 2024
-
[27]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)
work page internal anchor Pith review arXiv 2026
-
[28]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Auto-Encoding Variational Bayes
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
- [30]
-
[31]
MolmoAct: Action Reasoning Models that can Reason in Space
Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., et al.: Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 (2025)
work page internal anchor Pith review arXiv 2025
-
[32]
Li, C., Krause, A., Hutter, M.: Robotic world model: A neural net- work simulator for robust policy optimization in robotics. arXiv preprint arXiv:2501.10100 (2025)
-
[33]
Li, P., Chen, Y., Xu, Y., Yang, J., Wu, X., Guo, J., Sun, N., Qian, L., Li, X., Xiao, X., Liu, J., Liu, N., Kong, T., Huang, Y., Wang, L., Tan, T.: Multi-view video diffusion policy: A 3d spatio-temporal-aware video action model (2026),https://arxiv.org/abs/2604.03181
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Li, S., Gao, Y., Sadigh, D., Song, S.: Unified video action model. arXiv preprint arXiv:2503.00200 (2025)
work page internal anchor Pith review arXiv 2025
-
[35]
arXiv preprint arXiv:2511.04131 (2025)
Li, Y., Luo, Z., Zhang, T., Dai, C., Kanervisto, A., Tirinzoni, A., Weng, H., Kitani, K., Guzek, M., Touati, A., et al.: Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv preprint arXiv:2511.04131 (2025)
-
[36]
ACM Transactions on Graphics (TOG)44(6), 1–12 (2025)
Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. ACM Transactions on Graphics (TOG)44(6), 1–12 (2025)
2025
-
[37]
Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Liang, J., Tokmakov, P., Liu, R., Sudhakar, S., Shah, P., Ambrus, R., Vondrick, C.: Video generators are robot policies. arXiv preprint arXiv:2508.00795 (2025)
-
[38]
Lightricks:Ltxstudio.Online(2024),https://app.ltx.studio/,accessed: 2026-02
2024
-
[39]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Geometry-aware 4d video generation for robot manipulation.CoRR, abs/2507.01099, 2025
Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)
-
[41]
Ljung,L.,Glad,T.:Modelingofdynamicsystems.Prentice-Hall,Inc.(1994) Action Images: End-to-End Policy Learning via Multiview Video Generation 23
1994
-
[42]
Nakamoto, M., Mees, O., Kumar, A., Levine, S.: Steering your generalists: Improving robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816 (2024)
-
[43]
arXiv preprint arXiv:2401.08742 , year=
Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Efficient4d: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)
-
[44]
Imitating human behaviour with diffusion models
Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S.V., Tan, S.Z., Momennejad, I., Hofmann, K., et al.: Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677 (2023)
-
[45]
on a new geometry of space
Plucker, J.: Xvii. on a new geometry of space. Philosophical Transactions of the Royal Society of London (155), 725–791 (1865)
-
[46]
Journal of Intelligent & Robotic Systems 86(2), 153–173 (2017)
Polydoros,A.S.,Nalpantidis,L.:Surveyofmodel-basedreinforcementlearn- ing: Applications on robotics. Journal of Intelligent & Robotic Systems 86(2), 153–173 (2017)
2017
-
[47]
arXiv preprint arXiv:2402.08191
Pumacay, W., Singh, I., Duan, J., Krishna, R., Thomason, J., Fox, D.: The colosseum: A benchmark for evaluating generalization for robotic manipu- lation. arXiv preprint arXiv:2402.08191 (2024)
-
[48]
In: SC20: international confer- ence for high performance computing, networking, storage and analysis
Rajbhandari,S.,Rasley,J.,Ruwase,O.,He,Y.:Zero:Memoryoptimizations toward training trillion parameter models. In: SC20: international confer- ence for high performance computing, networking, storage and analysis. pp. 1–16. IEEE (2020)
2020
-
[49]
arXiv preprint arXiv:2312.17142 , year=
Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaus- sian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
-
[50]
Distilled feature fields enable few-shot language-guided manipulation
Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931 (2023)
-
[51]
arXiv preprint arXiv:2301.11280 , year=
Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., et al.: Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023)
-
[52]
arXiv preprint arXiv:2508.20840 (2025)
Sun, Q., Yang, L., Tang, W., Huang, W., Xu, K., Chen, Y., Liu, M., Yang, J., Zhu, H., Wang, Y., et al.: Learning primitive embodied world models: Towards scalable robotic learning. arXiv preprint arXiv:2508.20840 (2025)
-
[53]
ACM Sigart Bulletin2(4), 160–163 (1991)
Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin2(4), 160–163 (1991)
1991
-
[54]
Octo: An Open-Source Generalist Robot Policy
Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)
work page internal anchor Pith review arXiv 2024
-
[55]
In: Conference on Robot Learning
Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–
-
[56]
Wan: Open and Advanced Large-Scale Video Generative Models
Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025) 24 H. Zhen et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
2025
-
[58]
Applied AI Letters2(4), e52 (2021)
Watkins,O.,Huang,S.,Frost,J.,Bhatia,K.,Weiner,E.,Abbeel,P.,Darrell, T., Plummer, B., Saenko, K., Dragan, A.: Explaining robot policies. Applied AI Letters2(4), e52 (2021)
2021
-
[59]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
In: Conference on robot learning
Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: Conference on robot learning. pp. 2226–2240. PMLR (2023)
2023
-
[61]
Barron, and Aleksander Holynski
Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holyn- ski, A.: CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models. arXiv:2411.18613 (2024)
-
[62]
Xie, Y., Yao, C.H., Voleti, V., Jiang, H., Jampani, V.: Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470 (2024)
-
[63]
Xing, Y., Luo, X., Xie, J., Gao, L., Shen, H., Song, J.: Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation. arXiv preprint arXiv:2508.06426 (2025)
-
[64]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331–20341 (2024)
2024
-
[65]
World Action Models are Zero-shot Policies
Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)
work page internal anchor Pith review arXiv 2026
-
[66]
Advances in Neural Information Processing Systems37, 15272–15295 (2024)
Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation. Advances in Neural Information Processing Systems37, 15272–15295 (2024)
2024
-
[67]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Zhang, W., Li, Y., Qiao, Y., Huang, S., Liu, J., Dayoub, F., Ma, X., Liu, L.: Effective tuning strategies for generalist robot manipulation policies. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 7255–7262. IEEE (2025)
2025
-
[68]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference
Zhang,Z.,Liao,J.,Li,M.,Dai,Z.,Qiu,B.,Zhu,S.,Qin,L.,Wang,W.:Tora: Trajectory-oriented diffusion transformer for video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 2063– 2073 (2025)
2063
-
[69]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)
work page internal anchor Pith review arXiv 2024
-
[70]
Tesseract: Learning 4d embodied world models, 2025
Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)
-
[71]
In: Proceedings of the Action Images: End-to-End Policy Learning via Multiview Video Generation 25 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learn- ing a generalist model for embodied navigation. In: Proceedings of the Action Images: End-to-End Policy Learning via Multiview Video Generation 25 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624–13634 (2024)
2024
-
[72]
Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024)
-
[73]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)
2025
-
[74]
In: Conference on Robot Learning
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.