GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Ao Qu; Fangneng Zhan; Grace Chen; Hang Hua; Kaichen Zhou; Mengyu Wang; Paul Pu Liang; Xinhai Chang; Yilun Du; Yuzhen Chen

arxiv: 2605.22882 · v3 · pith:MO3QYE7Inew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Kaichen Zhou , Yuzhen Chen , Fangneng Zhan , Hang Hua , Grace Chen , Xinhai Chang , Ao Qu , Yilun Du

show 3 more authors

Zhuang Liu Paul Pu Liang Mengyu Wang

This is my paper

Pith reviewed 2026-06-30 16:42 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords video world modelsrobot manipulation4D correspondencegeometric consistencyinverse dynamicsvideo predictionworld models for robotics

0 comments

The pith

Injecting dense 4D correspondence supervision into video world models produces geometrically consistent futures that support reliable robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the gap between visually plausible video generation and physical usability by adding 4D point tracking signals distilled from a geometry model during training. Standard video world models generate futures that look right but lose track of the same object points over time, so they cannot reliably drive robot actions. The added supervision lets the model learn both appearance and geometry inside one network, after which an inverse dynamics head turns the resulting video sequences into executable trajectories. A sympathetic reader would care because this turns generative video into a practical planning tool for manipulation instead of a separate prediction step.

Core claim

GEM-4D resolves the inconsistency in video world models by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. An inverse dynamics module then converts the correspondence-consistent video rollouts into executable robot trajectories. The approach achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and raises real-world manipulation success from 61% to

What carries the argument

Dense 4D correspondence supervision distilled from a pretrained geometry foundation model, injected into the video generative backbone to enforce point tracking across frames.

If this is right

Generated videos maintain consistent 4D point identities across time steps.
The inverse dynamics module produces trajectories that can be executed directly on real robots.
Video prediction and geometric consistency metrics both reach state-of-the-art levels in simulation and real data.
Real-world manipulation success improves from 61% to 81% with no added inference cost.
The single-stream design keeps the model deployable without extra compute at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation technique may improve other generative models that must support physical interaction beyond manipulation.
Inconsistency in long-horizon rollouts is likely to grow without this form of geometric anchoring.
If the geometry foundation model improves independently, the video world model can inherit better consistency without retraining from scratch.

Load-bearing premise

The 4D correspondences distilled from the pretrained geometry model remain accurate enough across domains to guide the video model without systematic errors that would produce invalid robot trajectories.

What would settle it

Running the trained GEM-4D pipeline on manipulation sequences where the geometry foundation model is independently shown to output inaccurate 4D tracks, such as under heavy occlusion, and measuring whether manipulation success stays at or above 81%.

Figures

Figures reproduced from arXiv: 2605.22882 by Ao Qu, Fangneng Zhan, Grace Chen, Hang Hua, Kaichen Zhou, Mengyu Wang, Paul Pu Liang, Xinhai Chang, Yilun Du, Yuzhen Chen, Zhuang Liu.

**Figure 1.** Figure 1: Teaser. Given an instruction and an initial observation, our model predicts future frames while preserving geometric consistency. Compared with the baseline (left), our approach produces more realistic and structurally coherent scene evolution. Abstract. Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. A… view at source ↗

**Figure 2.** Figure 2: GEM-4D During training, a video DiT predicts the velocity of the noised video latent, while its intermediate features guide a geometry DiT to predict geometry velocity. This coupled training enforces geometry-consistent generation. During inference, only the video branch is used for efficient generation. 3 GEM-4D 3.1 Problem Formulation Given an initial observation I0 and a language instruction c, we lear… view at source ↗

**Figure 3.** Figure 3: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure. Initial Generated RGB Frames UF Robot Arm Action [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Generated Frames to Arm Action. From an initial observation, through GEM-4D-predicted future frames, to executed UF arm actions. throughout; it serves as a correspondence teacher whose knowledge we distill into the video backbone. To perform this distillation, we introduce a parallel flowmatching process over the geometry representation space. A Geometry DiT, conditioned on the video backbone’s intermedi… view at source ↗

**Figure 4.** Figure 4: Generated Frames to Arm Action. From an initial observation, through GEM-4D-predicted future frames, to executed UF arm actions. Geometry–kinematics pose fallback. Given the mask of end effector Mt ee, the frame It , the depth Dt, and the EE CAD model, FoundationPose [55] predicts the pose of EE (Rt ee, Tt ee) together with a confidence κt ∈ [0, 1] for each frame. (\mathbf {R}_{\text {ee}}^{t},\mathbf {T}_… view at source ↗

**Figure 5.** Figure 5: Qualitative 4D scene generation results on Droid (real) and RLBench [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF Arm manipulation [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEM-4D distills 4D correspondences from a geometry model into a single-stream video backbone and adds an inverse dynamics module, claiming SOTA consistency and a real-world manipulation lift from 61% to 81%, but the causal role of the supervision needs verification on target data.

read the letter

The main takeaway is that GEM-4D injects dense 4D point correspondence supervision, taken from a pretrained geometry foundation model, into video world model training to improve physical point tracking across frames. They keep a single-stream architecture so inference cost stays the same, then attach an inverse dynamics module that turns the consistent video rollouts into robot actions for manipulation tasks. The abstract reports state-of-the-art numbers on both prediction quality and geometric consistency in sim and real settings, plus the 20-point success rate gain.

What stands out as new is the concrete choice to distill those 4D tracks as a training signal rather than adding separate geometry branches or post-processing. The paper does a reasonable job framing the consistency problem as a blocker for closed-loop control and showing how the supervision can be added without complicating deployment.

The soft spots sit mostly with the quality and domain fit of the distilled correspondences. The stress-test note correctly flags that if the foundation model produces higher error on dynamic, occluded, or lighting-variable manipulation scenes than on its original training data, the added supervision could introduce systematic bias instead of fixing tracking. The abstract gives no error rates, ablations, or direct comparisons of track accuracy on the real robot data, so it is not yet clear how much of the 81% figure traces back to this mechanism versus other training choices. If the full paper supplies those checks and shows the gains survive when the supervision source is swapped or ablated, the central claim strengthens; otherwise the attribution stays provisional.

This is aimed at people working on video world models for robotics and embodied AI. A reader already thinking about grounding generative predictions for planning would pick up usable ideas on supervision and the inverse dynamics step.

The work shows clear engagement with the consistency issue and builds on external pretrained models in a reproducible way, so it merits peer review even if revisions are needed on the evaluation details. I would send it out rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper presents GEM-4D, a single-stream video world model that injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model during training to enforce physical consistency in generated videos. An inverse-dynamics module then converts the correspondence-consistent rollouts into robot trajectories. The central claims are SOTA video prediction and geometric consistency on simulation and real data, plus an improvement in real-world manipulation success from 61% to 81%.

Significance. If the distilled 4D supervision is shown to be accurate on the target distribution, the approach would offer a practical way to add geometric grounding to video world models without extra inference cost, directly addressing inconsistent point tracking that currently limits their use in manipulation.

major comments (2)

[Methods (distillation procedure) and Experiments (real-world evaluation)] The performance gain (61% → 81%) and SOTA geometric-consistency claims rest on the assumption that the distilled 4D correspondences remain accurate on real manipulation scenes. No quantitative error analysis, occlusion/lighting ablation, or comparison against ground-truth tracks on the real-world test distribution is provided to rule out systematic domain-gap errors that would make the supervision degrade rather than improve consistency.
[Experiments (real-world manipulation results)] The inverse-dynamics module is described as converting correspondence-consistent videos into trajectories, yet no ablation isolates whether the reported success-rate lift is caused by the 4D supervision versus other factors (e.g., training data scale, backbone choice, or inverse-dynamics architecture). Without this isolation the attribution to geometric consistency cannot be verified.

minor comments (1)

The abstract states that additional results are available at gem-4d.github.io; the manuscript should contain enough detail on the distillation loss, correspondence sampling strategy, and inverse-dynamics training so that the core claims can be assessed without external links.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with point-by-point responses, providing the strongest honest defense of the manuscript while acknowledging areas where additional clarification or experiments can strengthen the work.

read point-by-point responses

Referee: [Methods (distillation procedure) and Experiments (real-world evaluation)] The performance gain (61% → 81%) and SOTA geometric-consistency claims rest on the assumption that the distilled 4D correspondences remain accurate on real manipulation scenes. No quantitative error analysis, occlusion/lighting ablation, or comparison against ground-truth tracks on the real-world test distribution is provided to rule out systematic domain-gap errors that would make the supervision degrade rather than improve consistency.

Authors: We acknowledge the referee's valid concern about potential domain gap in the distilled 4D correspondences. The manuscript does not include a direct quantitative error analysis or ground-truth track comparison on the real-world test distribution, as obtaining dense 4D ground truth for real manipulation scenes is resource-intensive and not standard practice. However, the pretrained geometry foundation model is applied to the target distribution during distillation, and the reported SOTA geometric consistency metrics on real data (along with the 20-point real-world success improvement) provide supporting evidence that the supervision enhances rather than degrades performance. We will add a quantitative distillation error analysis on synthetic data with known ground truth, plus an occlusion/lighting ablation, to the revised manuscript. revision: yes
Referee: [Experiments (real-world manipulation results)] The inverse-dynamics module is described as converting correspondence-consistent videos into trajectories, yet no ablation isolates whether the reported success-rate lift is caused by the 4D supervision versus other factors (e.g., training data scale, backbone choice, or inverse-dynamics architecture). Without this isolation the attribution to geometric consistency cannot be verified.

Authors: We agree that isolating the contribution of 4D supervision is important for attribution. The manuscript's main real-world results compare the full GEM-4D model (with 4D supervision) against a baseline video world model that uses the identical inverse-dynamics module, backbone, and training data scale but lacks the 4D component; the 61% to 81% lift is therefore attributable to the geometric supervision under otherwise controlled conditions. To further strengthen this, we will include an explicit ablation that varies only the presence of 4D supervision while freezing all other factors including the inverse-dynamics architecture. revision: yes

Circularity Check

0 steps flagged

No circularity; supervision from external pretrained model

full rationale

The paper's central mechanism injects dense 4D correspondence supervision distilled from a pretrained geometry foundation model (external to this work) into the video generative backbone. This is independent input, not defined in terms of the model's outputs or fitted parameters. No equations, predictions, or uniqueness claims reduce by construction to self-citations, ansatzes, or renamed known results. The inverse dynamics module converts the resulting consistent videos to trajectories without evidence of tautological fitting. This matches the default expectation of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be identified from the text.

pith-pipeline@v0.9.1-grok · 5736 in / 1096 out tokens · 13398 ms · 2026-06-30T16:42:08.556924+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
cs.CV 2026-06 unverdicted novelty 7.0

MemoBench curates 360 ground-truth clips and an evaluation suite to diagnose memory consistency failures in video models when objects change state while out of view.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
cs.CV 2026-06 unverdicted novelty 7.0

MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
cs.CV 2026-06 unverdicted novelty 7.0

MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
cs.CV 2026-06 unverdicted novelty 6.0

MemoBench curates 360 clips and an evaluation suite to test video models on recovering updated object states after disappear-and-reappear in changing environments.
Bridge-WA: Predicting Where and How the World Changes for Robotic Action
cs.RO 2026-07 unverdicted novelty 4.0

Bridge-WA introduces a lightweight distillation-based world-action model that uses future-change priors to improve robotic task success and robustness without deployment-time dense rollouts.

Reference graph

Works this paper leans on

74 extracted references · 44 canonical work pages · cited by 2 Pith papers · 26 internal anchors

[1]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video gen- eration in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In: European Conference on Computer Vision (ECCV) (2024)

Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2act: Predict- ing point tracks from internet videos enables generalizable robot manipula- tion. In: European Conference on Computer Vision (ECCV) (2024)

2024
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Björck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.:Stable video dif- fusion:Scalinglatentvideodiffusionmodelstolargedatasets.arXivpreprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024
[8]

AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

2024
[9]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables gen- eralizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Easi3r: Estimating disentangled motion from dust3r without training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9158–9168 (2025)

2025
[11]

Mind: Unified visual imagination and control via hierarchical world models.arXiv preprint arXiv:2506.18897, 2025

Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y., Li, T., Han, L., Han, S., et al.: Mind: Learning a dual-system world model for real-time planning and implicit risk analysis. arXiv preprint arXiv:2506.18897 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Chi, X., Zhang, H., Fan, C.K., Qi, X., Zhang, R., Chen, A., Chan, C.m., Xue, W., Luo, W., Zhang, S., et al.: Eva: An embodied world model for future video anticipation. arXiv preprint arXiv:2410.15461 (2024) 16 K. Zhou et al

work page arXiv 2024
[13]

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking anypointinavideo.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS), Datasets and Benchmarks Track (2022)

2022
[14]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, J.B., et al.: Video language planning. arXiv preprint arXiv:2310.10625 (2023)

work page arXiv 2023
[15]

Advances in neural information processing systems36, 9156–9172 (2023)

Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

2023
[16]

In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

2014
[17]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d ob- ject reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 605–613 (2017)

2017
[18]

Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661, 2025

Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)

work page arXiv 2025
[19]

Learning video generation for robotic manipulation with collaborative trajectory control,

Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative tra- jectory control. arXiv preprint arXiv:2506.01943 (2025)

work page arXiv 2025
[20]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025
[22]

Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

Huang, A., Chen, J., Cheng, J., Song, R., Pan, W., Zhang, W.: Skill- aware diffusion for generalizable robotic manipulation. arXiv preprint arXiv:2601.11266 (2026)

work page arXiv 2026
[23]

Huang, Y.-W

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026
[24]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridg- ing the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

2020
[26]

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025)

2025
[27]

In: European conference on computer vision

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024) GEM-4D 17

2024
[28]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

2023
[32]

In: European Conference on Computer Vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

2024
[33]

arXiv preprint arXiv:????.???? (2025)

Li, Z., Wu, P., Han, X., Cai, R., Du, Y.: 4d latent world model for robot planning. arXiv preprint arXiv:????.???? (2025)

2025
[34]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Robotransfer: Geometry-consistent video diffusion for robotic visual policy transfer.arXiv preprint arXiv:2505.23171, 2025

Liu, L., Wang, X., Zhao, G., Li, K., Qin, W., Zhu, J., Qiu, J., Zhu, Z., Huang, G., Su, Z.: Robotransfer: Controllable geometry-consistent video diffusion for manipulation policy transfer. arXiv preprint arXiv:2505.23171 (2025)

work page arXiv 2025
[37]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Geometry-aware 4D Video Generation for Robot Manipulation

Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Graspgen: A diffusion-based framework for 6-dof grasping with on- generator training,

Murali,A.,Sundaralingam,B.,Chao,Y.W.,Yamada,J.,Yuan,W.,Carlson, M., Ramos, F., Birchfield, S., Fox, D., Eppner, C.: GraspGen: A diffusion- based framework for 6-DOF grasping with on-generator training. arXiv preprint arXiv:2507.13097 (2025)

work page arXiv 2025
[40]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[41]

Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y

Qi, H., Yin, H., Du, Y., Yang, H.: Strengthening generative robot policies through predictive world modeling. arXiv preprint arXiv:2502.00622 (2025)

work page arXiv 2025
[42]

arXiv preprint arXiv:2510.07313 (2025) 2, 4

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025) 18 K. Zhou et al

work page arXiv 2025
[43]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH)

Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH). pp. 245–254. ACM (1985)

1985
[45]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

Tao, S., Xiang, F., Shukla, A., Qin, Y., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

work page arXiv 2024
[46]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

2019
[47]

In: Conference on Robot Learning

Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–
[48]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

In: Interna- tional Conference on 3D Vision (3DV) (2025)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: Interna- tional Conference on 3D Vision (3DV) (2025)

2025
[50]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025)

2025
[52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

2024
[53]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality as- sessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004
[55]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024)

2024
[56]

In: Robotics: Science and Systems (RSS) (2024)

Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. In: Robotics: Science and Systems (RSS) (2024)

2024
[57]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) GEM-4D 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Wu, H., Yu, J., Zou, Y., Liu, X.: Multiworld: Scalable multi-agent multi- view video world models. arXiv preprint arXiv:2604.18564 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Advances in Neural Information Processing Systems37, 68082–68119 (2024)

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Inter- active videogpts are scalable world models. Advances in Neural Information Processing Systems37, 68082–68119 (2024)

2024
[61]

arXiv preprint arXiv:2412.19584 (2024)

Xu, K., Tse, T.H.E., Peng, J., Yao, A.: Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584 (2024)

work page arXiv 2024
[62]

arXiv preprint arXiv:2501.13928 (2025)

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. arXiv preprint arXiv:2501.13928 (2025)

work page arXiv 2025
[63]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

In: International Conference on Learning Representations (ICLR) (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Repre- sentation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (ICLR) (2025)

2025
[66]

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming inter- active multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

In: ICLR (2025)

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. In: ICLR (2025)

2025
[68]

Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

Zhang, X., Liao, J., Zhang, S., Meng, F., Wan, X., Yan, J., Cheng, Y.: Vide- orepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656 (2025)

work page arXiv 2025
[69]

Tesseract: Learning 4d embodied world models, 2025.https://arxiv.org/abs/2504.20995

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025
[70]

3DFlowAction: Learning cross-embodiment manipulation from 3d flow world model.arXiv preprint arXiv:2506.06199, 2025

Zhi, H., Chen, P., Zhou, S., Dong, Y., Wu, Q., Han, L., Tan, M.: 3dflowac- tion: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199 (2025)

work page arXiv 2025
[71]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Zhou, K., Wang, Y., Chen, G., Chang, X., Beaudouin, G., Zhan, F., Liang, P.P., Wang, M.: Page-4d: Disentangled pose and geometry estimation for 4d perception. arXiv preprint arXiv:2510.17568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Advances in Neural Information Processing Systems36, 69532–69545 (2023)

Zhou, K., Zhong, J.X., Shin, S., Lu, K., Yang, Y., Markham, A., Trigoni, N.: Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems36, 69532–69545 (2023)

2023
[73]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) 20 K. Zhou et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

2025

[1] [1]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video gen- eration in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

In: European Conference on Computer Vision (ECCV) (2024)

Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2act: Predict- ing point tracks from internet videos enables generalizable robot manipula- tion. In: European Conference on Computer Vision (ECCV) (2024)

2024

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Björck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flowmodelforgeneralrobotcontrol.arXivpreprintarXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.:Stable video dif- fusion:Scalinglatentvideodiffusionmodelstolargedatasets.arXivpreprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K.,Herzog, A., Hsu, J., et al.:Rt-1:Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Gener- ative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024

[8] [8]

AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. AdvancesinNeuralInformationProcessingSystems37,24081–24125(2024)

2024

[9] [9]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables gen- eralizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Easi3r: Estimating disentangled motion from dust3r without training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9158–9168 (2025)

2025

[11] [11]

Mind: Unified visual imagination and control via hierarchical world models.arXiv preprint arXiv:2506.18897, 2025

Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y., Li, T., Han, L., Han, S., et al.: Mind: Learning a dual-system world model for real-time planning and implicit risk analysis. arXiv preprint arXiv:2506.18897 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Chi, X., Zhang, H., Fan, C.K., Qi, X., Zhang, R., Chen, A., Chan, C.m., Xue, W., Luo, W., Zhang, S., et al.: Eva: An embodied world model for future video anticipation. arXiv preprint arXiv:2410.15461 (2024) 16 K. Zhou et al

work page arXiv 2024

[13] [13]

Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking anypointinavideo.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS), Datasets and Benchmarks Track (2022)

2022

[14] [14]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, J.B., et al.: Video language planning. arXiv preprint arXiv:2310.10625 (2023)

work page arXiv 2023

[15] [15]

Advances in neural information processing systems36, 9156–9172 (2023)

Du,Y.,Yang,S.,Dai,B.,Dai,H.,Nachum,O.,Tenenbaum,J.,Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

2023

[16] [16]

In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

2014

[17] [17]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d ob- ject reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 605–613 (2017)

2017

[18] [18]

Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661, 2025

Feng, Y., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., Zhu, J.: Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661 (2025)

work page arXiv 2025

[19] [19]

Learning video generation for robotic manipulation with collaborative trajectory control,

Fu, X., Wang, X., Liu, X., Bai, J., Xu, R., Wan, P., Zhang, D., Lin, D.: Learning video generation for robotic manipulation with collaborative tra- jectory control. arXiv preprint arXiv:2506.01943 (2025)

work page arXiv 2025

[20] [20]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable genera- tive world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025

[22] [22]

Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

Huang, A., Chen, J., Cheng, J., Song, R., Pan, W., Zhang, W.: Skill- aware diffusion for generalizable robotic manipulation. arXiv preprint arXiv:2601.11266 (2026)

work page arXiv 2026

[23] [23]

Huang, Y.-W

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026

[24] [24]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridg- ing the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

2020

[26] [26]

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 6013–6022 (2025)

2025

[27] [27]

In: European conference on computer vision

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024) GEM-4D 17

2024

[28] [28]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karam- cheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[31] [31]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

2023

[32] [32]

In: European Conference on Computer Vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

2024

[33] [33]

arXiv preprint arXiv:????.???? (2025)

Li, Z., Wu, P., Han, X., Cai, R., Du, Y.: 4d latent world model for robot planning. arXiv preprint arXiv:????.???? (2025)

2025

[34] [34]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Robotransfer: Geometry-consistent video diffusion for robotic visual policy transfer.arXiv preprint arXiv:2505.23171, 2025

Liu, L., Wang, X., Zhao, G., Li, K., Qin, W., Zhu, J., Qiu, J., Zhu, Z., Huang, G., Su, Z.: Robotransfer: Controllable geometry-consistent video diffusion for manipulation policy transfer. arXiv preprint arXiv:2505.23171 (2025)

work page arXiv 2025

[37] [37]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Geometry-aware 4D Video Generation for Robot Manipulation

Liu, Z., Li, S., Cousineau, E., Feng, S., Burchfiel, B., Song, S.: Geometry-aware 4d video generation for robot manipulation. arXiv preprint arXiv:2507.01099 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Graspgen: A diffusion-based framework for 6-dof grasping with on- generator training,

Murali,A.,Sundaralingam,B.,Chao,Y.W.,Yamada,J.,Yuan,W.,Carlson, M., Ramos, F., Birchfield, S., Fox, D., Eppner, C.: GraspGen: A diffusion- based framework for 6-DOF grasping with on-generator training. arXiv preprint arXiv:2507.13097 (2025)

work page arXiv 2025

[40] [40]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023

[41] [41]

Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y

Qi, H., Yin, H., Du, Y., Yang, H.: Strengthening generative robot policies through predictive world modeling. arXiv preprint arXiv:2502.00622 (2025)

work page arXiv 2025

[42] [42]

arXiv preprint arXiv:2510.07313 (2025) 2, 4

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025) 18 K. Zhou et al

work page arXiv 2025

[43] [43]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH)

Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Tech- niques (SIGGRAPH). pp. 245–254. ACM (1985)

1985

[45] [45]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

Tao, S., Xiang, F., Shukla, A., Qin, Y., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

work page arXiv 2024

[46] [46]

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation (2019)

2019

[47] [47]

In: Conference on Robot Learning

Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–

[48] [48]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

In: Interna- tional Conference on 3D Vision (3DV) (2025)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: Interna- tional Conference on 3D Vision (3DV) (2025)

2025

[50] [50]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025

[51] [51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025)

2025

[52] [52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

2024

[53] [53]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality as- sessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004

[55] [55]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024)

2024

[56] [56]

In: Robotics: Science and Systems (RSS) (2024)

Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. In: Robotics: Science and Systems (RSS) (2024)

2024

[57] [57]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) GEM-4D 19

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Wu, H., Yu, J., Zou, Y., Liu, X.: Multiworld: Scalable multi-agent multi- view video world models. arXiv preprint arXiv:2604.18564 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[60] [60]

Advances in Neural Information Processing Systems37, 68082–68119 (2024)

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Inter- active videogpts are scalable world models. Advances in Neural Information Processing Systems37, 68082–68119 (2024)

2024

[61] [61]

arXiv preprint arXiv:2412.19584 (2024)

Xu, K., Tse, T.H.E., Peng, J., Yao, A.: Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584 (2024)

work page arXiv 2024

[62] [62]

arXiv preprint arXiv:2501.13928 (2025)

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. arXiv preprint arXiv:2501.13928 (2025)

work page arXiv 2025

[63] [63]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

In: International Conference on Learning Representations (ICLR) (2025)

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Repre- sentation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (ICLR) (2025)

2025

[66] [66]

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming inter- active multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

In: ICLR (2025)

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. In: ICLR (2025)

2025

[68] [68]

Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

Zhang, X., Liao, J., Zhang, S., Meng, F., Wan, X., Yan, J., Cheng, Y.: Vide- orepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656 (2025)

work page arXiv 2025

[69] [69]

Tesseract: Learning 4d embodied world models, 2025.https://arxiv.org/abs/2504.20995

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025

[70] [70]

3DFlowAction: Learning cross-embodiment manipulation from 3d flow world model.arXiv preprint arXiv:2506.06199, 2025

Zhi, H., Chen, P., Zhou, S., Dong, Y., Wu, Q., Han, L., Tan, M.: 3dflowac- tion: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199 (2025)

work page arXiv 2025

[71] [71]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Zhou, K., Wang, Y., Chen, G., Chang, X., Beaudouin, G., Zhan, F., Liang, P.P., Wang, M.: Page-4d: Disentangled pose and geometry estimation for 4d perception. arXiv preprint arXiv:2510.17568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Advances in Neural Information Processing Systems36, 69532–69545 (2023)

Zhou, K., Zhong, J.X., Shin, S., Lu, K., Yang, Y., Markham, A., Trigoni, N.: Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems36, 69532–69545 (2023)

2023

[73] [73]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) 20 K. Zhou et al

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

2025