GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Ao Qu; Fangneng Zhan; Grace Chen; Hang Hua; Kaichen Zhou; Mengyu Wang; Paul Pu Liang; Xinhai Chang; Yilun Du; Yuzhen Chen

arxiv: 2605.22882 · v1 · pith:MO3QYE7Inew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Kaichen Zhou , Yuzhen Chen , Fangneng Zhan , Hang Hua , Grace Chen , Xinhai Chang , Ao Qu , Yilun Du

show 3 more authors

Zhuang Liu Paul Pu Liang Mengyu Wang

This is my paper

Pith reviewed 2026-05-25 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords video world modelsrobot manipulation4D correspondencegeometric consistencyinverse dynamicsvideo predictionpoint-level motion

0 comments

The pith

Injecting dense 4D correspondence supervision into video world models preserves point-level motion consistency and raises real-world robot manipulation success from 61% to 81%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video world models produce plausible futures yet fail at consistent point motion over time, which breaks the physical grounding needed to turn generated videos into reliable robot actions. GEM-4D fixes this by distilling dense 4D correspondences from a pretrained geometry model and using them as supervision inside the video generative backbone, allowing the model to learn both appearance and geometry together. The architecture stays single-stream, so there is no added cost when generating videos at inference time. An added inverse dynamics module then converts the geometrically consistent video sequences directly into robot trajectories that can be executed in simulation or on real hardware.

Core claim

GEM-4D resolves the limitation of inconsistent point-level motion in video world models by distilling dense 4D correspondences from a pretrained geometry foundation model and injecting them as supervision during training of the video generative backbone. This enables joint capture of appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. An inverse dynamics module converts the correspondence-consistent video rollouts into executable robot trajectories. The resulting model reaches state-of-the-art performance on video prediction and geometric consistency in both simulated and realistic settings and improves real-world manipulation成功率

What carries the argument

Dense 4D correspondence supervision distilled from a pretrained geometry foundation model and injected into the video generative backbone during training.

If this is right

The model jointly captures appearance and geometric structure from the injected supervision.
Single-stream design means generated videos incur no extra compute at inference time.
Inverse dynamics module converts consistent video rollouts into executable trajectories for direct real-world and simulated use.
Performance reaches state-of-the-art on video prediction and geometric consistency metrics across simulation and realistic scenarios.
Real-world manipulation success rate increases from 61% to 81%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correspondence supervision could be applied to other video-based planning domains such as navigation or object rearrangement where long-term point consistency matters.
Because the architecture change is confined to training, the approach may transfer to existing video model checkpoints with modest additional training.
If the geometry foundation model itself improves, GEM-4D performance would improve without any change to the video backbone.
The method suggests that explicit geometric signals can substitute for architectural complexity when building world models for physical tasks.

Load-bearing premise

The dense 4D correspondences supply accurate, unbiased supervision that directly improves physical consistency for manipulation without introducing new failure modes.

What would settle it

Running GEM-4D and the baseline on the same manipulation tasks and measuring point drift in generated videos or end-to-end success rates; if GEM-4D shows equal or greater drift or lower success than 61%, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22882 by Ao Qu, Fangneng Zhan, Grace Chen, Hang Hua, Kaichen Zhou, Mengyu Wang, Paul Pu Liang, Xinhai Chang, Yilun Du, Yuzhen Chen, Zhuang Liu.

**Figure 1.** Figure 1: Teaser. Given an instruction and an initial observation, our model predicts future frames while preserving geometric consistency. Compared with the baseline (left), our approach produces more realistic and structurally coherent scene evolution. Abstract. Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. A… view at source ↗

**Figure 2.** Figure 2: GEM-4D During training, a video DiT predicts the velocity of the noised video latent, while its intermediate features guide a geometry DiT to predict geometry velocity. This coupled training enforces geometry-consistent generation. During inference, only the video branch is used for efficient generation. 3 GEM-4D 3.1 Problem Formulation Given an initial observation I0 and a language instruction c, we lear… view at source ↗

**Figure 3.** Figure 3: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure. Initial Generated RGB Frames UF Robot Arm Action [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Generated Frames to Arm Action. From an initial observation, through GEM-4D-predicted future frames, to executed UF arm actions. throughout; it serves as a correspondence teacher whose knowledge we distill into the video backbone. To perform this distillation, we introduce a parallel flowmatching process over the geometry representation space. A Geometry DiT, conditioned on the video backbone’s intermedi… view at source ↗

**Figure 5.** Figure 5: Qualitative 4D scene generation results on Droid (real) and RLBench [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF Arm manipulation [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: https://anonymous-submission-20.github.io/gem.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces GEM-4D, a single-stream video world model that injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the generative backbone. This supervision is intended to enforce point-level motion consistency and physical grounding without extra inference cost. An inverse dynamics module then maps the resulting correspondence-consistent video rollouts to executable robot actions. The manuscript reports state-of-the-art results on video prediction and geometric-consistency metrics across simulation and real scenarios, together with an improvement in real-world manipulation success rate from 61% to 81%.

Significance. If the reported gains hold under the experimental protocols detailed in the full manuscript, the work shows that external geometric supervision can measurably improve the downstream utility of video world models for manipulation while preserving architectural simplicity. The inclusion of ablations on the supervision signal and quantitative 3D point-tracking metrics provides concrete evidence that the geometric component contributes to the observed performance lift.

minor comments (2)

[Abstract] Abstract: the headline performance numbers (61% to 81% success) are presented without any mention of trial count, baseline methods, or statistical tests; a one-sentence summary of the evaluation protocol would improve readability.
The manuscript should explicitly state the precise form of the 4D correspondence loss (e.g., whether it is an L2 distance on 3D points or a reprojection error) and its weighting relative to the standard video-generation objective.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of GEM-4D and the recommendation for minor revision. We are encouraged that the role of 4D correspondence supervision in improving geometric consistency and downstream manipulation performance is recognized.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution injects dense 4D correspondence supervision distilled from an external pretrained geometry foundation model into the video generative backbone. This supervision source is independent of the target model and its outputs. The inverse dynamics module is presented as an additional component that converts the resulting rollouts into trajectories. No equations or claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; empirical ablations and real-world success rates are reported as external validation rather than tautological results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an external geometry foundation model supplies reliable 4D correspondences; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption A pretrained geometry foundation model produces dense 4D correspondences that are accurate enough to supervise physical motion in manipulation videos.
The method uses these correspondences as training signal; the abstract does not provide independent verification of their accuracy for the target domain.

pith-pipeline@v0.9.0 · 5745 in / 1248 out tokens · 28014 ms · 2026-05-25T05:32:33.140053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 25 internal anchors

[1]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video gen- eration in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In: European Conference on Computer Vision (ECCV) (2024)

Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2act: Predict- ing point tracks from internet videos enables generalizable robot manipula- tion. In: European Conference on Computer Vision (ECCV) (2024)

work page 2024
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Björck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.:Stable video dif- fusion:Scalinglatentvideodiffusionmodelstolargedatasets.arXivpreprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

work page 2024
[8]

AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

work page 2024
[9]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables gen- eralizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Easi3r: Estimating disentangled motion from dust3r without training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9158–9168 (2025)

work page 2025
[11]

arXiv preprint arXiv:2506.18897 (2025)

Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y., Li, T., Han, L., Han, S., et al.: Mind: Learning a dual-system world model for real-time planning and implicit risk analysis. arXiv preprint arXiv:2506.18897 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Chi, X., Zhang, H., Fan, C.K., Qi, X., Zhang, R., Chen, A., Chan, C.m., Xue, W., Luo, W., Zhang, S., et al.: Eva: An embodied world model for future video anticipation. arXiv preprint arXiv:2410.15461 (2024) 16 K. Zhou et al

work page arXiv 2024
[13]

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking anypointinavideo.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS), Datasets and Benchmarks Track (2022)

work page 2022
[14]

arXiv preprint arXiv:2310.10625 (2023)

Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, J.B., et al.: Video language planning. arXiv preprint arXiv:2310.10625 (2023)

work page arXiv 2023
[15]

Advances in neural information processing systems36, 9156–9172 (2023)

Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

work page 2023
[16]

In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

work page 2014
[17]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d ob- ject reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 605–613 (2017)

work page 2017
[18]

arXiv preprint arXiv:2512.17661 (2025)

Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)

work page arXiv 2025
[19]

arXiv preprint arXiv:2506.01943 (2025)

Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative tra- jectory control. arXiv preprint arXiv:2506.01943 (2025)

work page arXiv 2025
[20]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Ad- vances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Ad- vances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[22]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025
[23]

arXiv preprint arXiv:2601.11266 (2026)

Huang, A., Chen, J., Cheng, J., Song, R., Pan, W., Zhang, W.: Skill- aware diffusion for generalizable robotic manipulation. arXiv preprint arXiv:2601.11266 (2026)

work page arXiv 2026
[24]

arXiv preprint arXiv:2601.03782 (2026)

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026
[25]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridg- ing the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

work page 2020
[27]

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025) GEM-4D 17

work page 2025
[28]

In: European conference on computer vision

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024)

work page 2024
[29]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

work page 2023
[33]

In: European Conference on Computer Vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

work page 2024
[34]

arXiv preprint arXiv:????.???? (2025)

Li, Z., Wu, P., Han, X., Cai, R., Du, Y.: 4d latent world model for robot planning. arXiv preprint arXiv:????.???? (2025)

work page 2025
[35]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

arXiv preprint arXiv:2505.23171 (2025)

Liu, L., Wang, X., Zhao, G., Li, K., Qin, W., Zhu, J., Qiu, J., Zhu, Z., Huang, G., Su, Z.: Robotransfer: Controllable geometry-consistent video diffusion for manipulation policy transfer. arXiv preprint arXiv:2505.23171 (2025)

work page arXiv 2025
[38]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Geometry-aware 4D Video Generation for Robot Manipulation

Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

arXiv preprint arXiv:2507.13097 (2025)

Murali,A.,Sundaralingam,B.,Chao,Y.W.,Yamada,J.,Yuan,W.,Carlson, M., Ramos, F., Birchfield, S., Fox, D., Eppner, C.: GraspGen: A diffusion- based framework for 6-DOF grasping with on-generator training. arXiv preprint arXiv:2507.13097 (2025)

work page arXiv 2025
[41]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 18 K. Zhou et al

work page 2023
[42]

arXiv preprint arXiv:2502.00622 (2025)

Qi, H., Yin, H., Du, Y., Yang, H.: Strengthening generative robot policies through predictive world modeling. arXiv preprint arXiv:2502.00622 (2025)

work page arXiv 2025
[43]

arXiv preprint arXiv:2510.07313 (2025)

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025)

work page arXiv 2025
[44]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH)

Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH). pp. 245–254. ACM (1985)

work page 1985
[46]

arXiv preprint arXiv:2410.00425 (2024)

Tao, S., Xiang, F., Shukla, A., Qin, Y., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

work page arXiv 2024
[47]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

work page 2019
[48]

In: Conference on Robot Learning

Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–

work page
[49]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

In: Interna- tional Conference on 3D Vision (3DV) (2025)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: Interna- tional Conference on 3D Vision (3DV) (2025)

work page 2025
[51]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

work page 2025
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025)

work page 2025
[53]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

work page 2024
[54]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality as- sessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

work page 2004
[56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024) GEM-4D 19

work page 2024
[57]

In: Robotics: Science and Systems (RSS) (2024)

Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. In: Robotics: Science and Systems (RSS) (2024)

work page 2024
[58]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Wu, H., Yu, J., Zou, Y., Liu, X.: Multiworld: Scalable multi-agent multi- view video world models. arXiv preprint arXiv:2604.18564 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Advances in Neural Information Processing Systems37, 68082–68119 (2024)

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Inter- active videogpts are scalable world models. Advances in Neural Information Processing Systems37, 68082–68119 (2024)

work page 2024
[62]

arXiv preprint arXiv:2412.19584 (2024)

Xu, K., Tse, T.H.E., Peng, J., Yao, A.: Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584 (2024)

work page arXiv 2024
[63]

arXiv preprint arXiv:2501.13928 (2025)

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. arXiv preprint arXiv:2501.13928 (2025)

work page arXiv 2025
[64]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

In: International Conference on Learning Representations (ICLR) (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Repre- sentation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (ICLR) (2025)

work page 2025
[67]

arXiv preprint arXiv:2510.09036 (2025)

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming inter- active multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page arXiv 2025
[68]

In: ICLR (2025)

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. In: ICLR (2025)

work page 2025
[69]

arXiv preprint arXiv:2505.23656 (2025)

Zhang, X., Liao, J., Zhang, S., Meng, F., Wan, X., Yan, J., Cheng, Y.: Vide- orepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656 (2025)

work page arXiv 2025
[70]

arXiv preprint arXiv:2504.20995 (2025)

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025
[71]

arXiv preprint arXiv:2506.06199 (2025)

Zhi, H., Chen, P., Zhou, S., Dong, Y., Wu, Q., Han, L., Tan, M.: 3dflowac- tion: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199 (2025)

work page arXiv 2025
[72]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Zhou, K., Wang, Y., Chen, G., Chang, X., Beaudouin, G., Zhan, F., Liang, P.P., Wang, M.: Page-4d: Disentangled pose and geometry estimation for 4d perception. arXiv preprint arXiv:2510.17568 (2025) 20 K. Zhou et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Advances in Neural Information Processing Systems36, 69532–69545 (2023)

Zhou, K., Zhong, J.X., Shin, S., Lu, K., Yang, Y., Markham, A., Trigoni, N.: Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems36, 69532–69545 (2023)

work page 2023
[74]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

work page 2025

[1] [1]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video gen- eration in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

In: European Conference on Computer Vision (ECCV) (2024)

Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2act: Predict- ing point tracks from internet videos enables generalizable robot manipula- tion. In: European Conference on Computer Vision (ECCV) (2024)

work page 2024

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Björck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.:Stable video dif- fusion:Scalinglatentvideodiffusionmodelstolargedatasets.arXivpreprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

work page 2024

[8] [8]

AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

work page 2024

[9] [9]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables gen- eralizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Easi3r: Estimating disentangled motion from dust3r without training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9158–9168 (2025)

work page 2025

[11] [11]

arXiv preprint arXiv:2506.18897 (2025)

Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y., Li, T., Han, L., Han, S., et al.: Mind: Learning a dual-system world model for real-time planning and implicit risk analysis. arXiv preprint arXiv:2506.18897 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Chi, X., Zhang, H., Fan, C.K., Qi, X., Zhang, R., Chen, A., Chan, C.m., Xue, W., Luo, W., Zhang, S., et al.: Eva: An embodied world model for future video anticipation. arXiv preprint arXiv:2410.15461 (2024) 16 K. Zhou et al

work page arXiv 2024

[13] [13]

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking anypointinavideo.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS), Datasets and Benchmarks Track (2022)

work page 2022

[14] [14]

arXiv preprint arXiv:2310.10625 (2023)

Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, J.B., et al.: Video language planning. arXiv preprint arXiv:2310.10625 (2023)

work page arXiv 2023

[15] [15]

Advances in neural information processing systems36, 9156–9172 (2023)

Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

work page 2023

[16] [16]

In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

work page 2014

[17] [17]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d ob- ject reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 605–613 (2017)

work page 2017

[18] [18]

arXiv preprint arXiv:2512.17661 (2025)

Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)

work page arXiv 2025

[19] [19]

arXiv preprint arXiv:2506.01943 (2025)

Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative tra- jectory control. arXiv preprint arXiv:2506.01943 (2025)

work page arXiv 2025

[20] [20]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Ad- vances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Ad- vances in neural information processing systems33, 6840–6851 (2020)

work page 2020

[22] [22]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025

[23] [23]

arXiv preprint arXiv:2601.11266 (2026)

Huang, A., Chen, J., Cheng, J., Song, R., Pan, W., Zhang, W.: Skill- aware diffusion for generalizable robotic manipulation. arXiv preprint arXiv:2601.11266 (2026)

work page arXiv 2026

[24] [24]

arXiv preprint arXiv:2601.03782 (2026)

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026

[25] [25]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridg- ing the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

work page 2020

[27] [27]

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025) GEM-4D 17

work page 2025

[28] [28]

In: European conference on computer vision

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024)

work page 2024

[29] [29]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[32] [32]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

work page 2023

[33] [33]

In: European Conference on Computer Vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

work page 2024

[34] [34]

arXiv preprint arXiv:????.???? (2025)

Li, Z., Wu, P., Han, X., Cai, R., Du, Y.: 4d latent world model for robot planning. arXiv preprint arXiv:????.???? (2025)

work page 2025

[35] [35]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

arXiv preprint arXiv:2505.23171 (2025)

Liu, L., Wang, X., Zhao, G., Li, K., Qin, W., Zhu, J., Qiu, J., Zhu, Z., Huang, G., Su, Z.: Robotransfer: Controllable geometry-consistent video diffusion for manipulation policy transfer. arXiv preprint arXiv:2505.23171 (2025)

work page arXiv 2025

[38] [38]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Geometry-aware 4D Video Generation for Robot Manipulation

Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

arXiv preprint arXiv:2507.13097 (2025)

Murali,A.,Sundaralingam,B.,Chao,Y.W.,Yamada,J.,Yuan,W.,Carlson, M., Ramos, F., Birchfield, S., Fox, D., Eppner, C.: GraspGen: A diffusion- based framework for 6-DOF grasping with on-generator training. arXiv preprint arXiv:2507.13097 (2025)

work page arXiv 2025

[41] [41]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 18 K. Zhou et al

work page 2023

[42] [42]

arXiv preprint arXiv:2502.00622 (2025)

Qi, H., Yin, H., Du, Y., Yang, H.: Strengthening generative robot policies through predictive world modeling. arXiv preprint arXiv:2502.00622 (2025)

work page arXiv 2025

[43] [43]

arXiv preprint arXiv:2510.07313 (2025)

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025)

work page arXiv 2025

[44] [44]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH)

Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH). pp. 245–254. ACM (1985)

work page 1985

[46] [46]

arXiv preprint arXiv:2410.00425 (2024)

Tao, S., Xiang, F., Shukla, A., Qin, Y., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

work page arXiv 2024

[47] [47]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

work page 2019

[48] [48]

In: Conference on Robot Learning

Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–

work page

[49] [49]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

In: Interna- tional Conference on 3D Vision (3DV) (2025)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: Interna- tional Conference on 3D Vision (3DV) (2025)

work page 2025

[51] [51]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

work page 2025

[52] [52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025)

work page 2025

[53] [53]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

work page 2024

[54] [54]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality as- sessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

work page 2004

[56] [56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024) GEM-4D 19

work page 2024

[57] [57]

In: Robotics: Science and Systems (RSS) (2024)

Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. In: Robotics: Science and Systems (RSS) (2024)

work page 2024

[58] [58]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Wu, H., Yu, J., Zou, Y., Liu, X.: Multiworld: Scalable multi-agent multi- view video world models. arXiv preprint arXiv:2604.18564 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Advances in Neural Information Processing Systems37, 68082–68119 (2024)

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Inter- active videogpts are scalable world models. Advances in Neural Information Processing Systems37, 68082–68119 (2024)

work page 2024

[62] [62]

arXiv preprint arXiv:2412.19584 (2024)

Xu, K., Tse, T.H.E., Peng, J., Yao, A.: Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584 (2024)

work page arXiv 2024

[63] [63]

arXiv preprint arXiv:2501.13928 (2025)

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. arXiv preprint arXiv:2501.13928 (2025)

work page arXiv 2025

[64] [64]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

In: International Conference on Learning Representations (ICLR) (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Repre- sentation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (ICLR) (2025)

work page 2025

[67] [67]

arXiv preprint arXiv:2510.09036 (2025)

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming inter- active multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page arXiv 2025

[68] [68]

In: ICLR (2025)

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. In: ICLR (2025)

work page 2025

[69] [69]

arXiv preprint arXiv:2505.23656 (2025)

Zhang, X., Liao, J., Zhang, S., Meng, F., Wan, X., Yan, J., Cheng, Y.: Vide- orepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656 (2025)

work page arXiv 2025

[70] [70]

arXiv preprint arXiv:2504.20995 (2025)

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025

[71] [71]

arXiv preprint arXiv:2506.06199 (2025)

Zhi, H., Chen, P., Zhou, S., Dong, Y., Wu, Q., Han, L., Tan, M.: 3dflowac- tion: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199 (2025)

work page arXiv 2025

[72] [72]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Zhou, K., Wang, Y., Chen, G., Chang, X., Beaudouin, G., Zhan, F., Liang, P.P., Wang, M.: Page-4d: Disentangled pose and geometry estimation for 4d perception. arXiv preprint arXiv:2510.17568 (2025) 20 K. Zhou et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Advances in Neural Information Processing Systems36, 69532–69545 (2023)

Zhou, K., Zhong, J.X., Shin, S., Lu, K., Yang, Y., Markham, A., Trigoni, N.: Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems36, 69532–69545 (2023)

work page 2023

[74] [74]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

work page 2025