pith. machine review for the scientific record. sign in

arxiv: 2604.19741 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generation3D consistencygeo-registered datacity simulationspatially groundedloop closurevideo diffusionautonomous driving simulation
0
0 comments X

The pith

CityRAG generates coherent minutes-long videos of real cities grounded in geo-registered data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CityRAG to solve the challenge of creating video simulations that match real city locations in 3D space. It does this by feeding the model with lots of location-specific but not time-matched video clips, allowing it to learn the permanent features of the scene separately from temporary ones like cars or rain. A reader would care because this enables creating useful, realistic virtual environments for testing self-driving cars or robots without collecting perfectly matched data for every scenario. The results show it can keep videos consistent for minutes while following real paths and even looping back correctly.

Core claim

CityRAG is a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. It relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Experiments demonstrate that it can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

What carries the argument

CityRAG, a video generative model conditioned on geo-registered data to enforce spatial grounding and 3D consistency across long sequences.

Load-bearing premise

That large corpora of geo-registered but temporally unaligned data are sufficient to teach the model to semantically disentangle the underlying scene from transient attributes while still producing 3D-consistent navigation over long sequences.

What would settle it

A test video sequence following a known real-world trajectory that shows drift in 3D geometry or failure to close loops despite the geo-conditioning.

Figures

Figures reproduced from arXiv: 2604.19741 by Bharath Hariharan, Boyang Deng, Charles Herrmann, Gene Chou, Jason Y. Zhang, Kyle Genova, Noah Snavely, Philipp Henzler, Songyou Peng.

Figure 1
Figure 1. Figure 1: CityRAG generates minutes-long, spatially grounded video sequences that 1) render real buildings, traffic lights, and roads of a city; 2) follow a user-defined path and perform loop closure after generating a thousand frames; 3) are initialized from a first image and respects its weather conditions and dynamic objects. Top: Panthéon and surrounding buildings, Paris. Middle: Calle Quiñones St, San Juan. Bot… view at source ↗
Figure 2
Figure 2. Figure 2: Training data pipeline. We use Street View data in the form of panoramas. We create a training pair if there is a continuous path where there exists 2 sets of captures at different times (e.g., morning vs. afternoon) but with an average distance < 5 meters, so the model learns to disentangle static and transient attributes, e.g., roads and buildings (green box) vs. weather and cars (red box). easily access… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of Video Generation in CityRAG. The generator takes three conditions: the first image, a trajectory, and geo-registered data in the form of video frames (denoted geospatial con￾ditioning) along the trajectory. We finetune from a state-of￾the-art image-to-video (I2V) generative model, Wan 2.1 (14B). It consists of a spatio￾temporal VAE and a DiT￾based diffusion model. The condition inputs to th… view at source ↗
Figure 4
Figure 4. Figure 4: RAG pipeline at inference-time. The user first selects a location and image that they want to step into. Then with a user-specified trajectory we use the Street View Database to retrieve our geospatial conditioning. All conditions are passed to the video model which generates the output the user sees. We then automatically update the first frame and location and repeat the process. Same location, with a 90… view at source ↗
Figure 5
Figure 5. Figure 5: Making geospatial retrieval work for arbitrary trajectories. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons. We show three challenging test samples. Input conditions include the video for geospatial conditioning (leftmost column), the first image of the ground truth video (rightmost column), and the trajectory defined by the ground truth video. Scene A: Our method consistently follows the weather and the black car in front. Scene B: Our method reconstructs buildings (t=7s) that appear lat… view at source ↗
Figure 7
Figure 7. Figure 7: User study results. Users were asked to rate each video on a scale of 1 (lowest) to 3 (highest). Larger size indicates higher visual quality. Only our method gen￾erates a video that is both a smooth contin￾uation from the first image, and a faithful render of the real physical location. The questions ask users to evaluate the generated videos’ visual quality, whether they are smooth continua￾tions of the f… view at source ↗
Figure 8
Figure 8. Figure 8: Flexibility of trajectory conditioning. Our trajectory conditioning does not have to be precisely aligned with the geospatial conditioning. Left: Even though there is a translation mismatch between the geospatial (car stuck in trajectory) and trajectory left turn, our model correctly generates a plausible sequence following the trajectory, despite the mismatch. Right: Our model is capable of rotating 360 d… view at source ↗
Figure 9
Figure 9. Figure 9: User study Q1. 3: Visually similar and most people would agree that they belong to the same location. There can be noticeable distortions or artifacts. 2: There are some similarities, but might not be the same location. Maybe con￾tains distortions or artifacts. 1: Distinctly different. Likely two completely different locations. B Appendix B: Ethics and Privacy As CityRAG aims to generate realistic videos o… view at source ↗
Figure 10
Figure 10. Figure 10: User study Q2. B.2 Bias in Data Distribution Although we collected data from 10 cities, across 4 continents, the majority of the data is located in Western countries. This could introduce representation bias. Though CityRAG is a research paper without direct use in products or applications, in the future, any follow up work should attempt to mitigate this bias via more diverse data collection or algorithm… view at source ↗
Figure 11
Figure 11. Figure 11: User study Q3 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CityRAG, a video generative model that uses large corpora of geo-registered but temporally unaligned data as context to produce 3D-consistent, navigable video sequences simulating real-world locations. It claims to achieve semantic disentanglement of scene structure from transient attributes, enabling minutes-long coherent generation, maintenance of weather/lighting over thousands of frames, loop closure, and navigation along complex trajectories to reconstruct real geography, with applications to autonomous driving and robotics simulation.

Significance. If the central claims hold with rigorous validation, CityRAG would represent a meaningful advance in grounded video generation by demonstrating that unaligned geo-registered corpora can support long-horizon physical consistency without explicit 3D supervision or temporal alignment. This could impact simulation pipelines where real-world spatial grounding is required.

major comments (2)
  1. [Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.
  2. [Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.
minor comments (1)
  1. [Abstract] The abstract uses terms such as 'physically grounded' and 'real-world geography' without defining the precise criteria or evaluation protocol used to verify them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. The comments highlight important aspects of validation for our claims regarding long-horizon consistency and disentanglement. We address each point below and commit to revisions that strengthen the manuscript's rigor without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.

    Authors: We agree that the abstract presents these outcomes at a summary level without embedded numerical results or direct comparisons. The full manuscript supports the claims through extensive qualitative demonstrations of minute-long coherent sequences, loop closure under navigation, and preservation of scene structure across varying conditions. To directly address the concern, we will revise the abstract to include a concise summary of supporting evidence (e.g., frame counts and observed consistency indicators) and expand the experiments section with quantitative metrics where feasible for generative video tasks, baseline comparisons against ungrounded models, and ablation studies on the role of geo-registration. This will enable clearer assessment of the physical grounding and disentanglement. revision: yes

  2. Referee: [Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.

    Authors: We recognize that the abstract does not isolate an explicit test or counter-example for the disentanglement assumption. The manuscript's design relies on training with temporally unaligned geo-registered corpora to promote separation of fixed scene elements from transients, which is reflected in the generated outputs maintaining geometry while varying weather and dynamics. To strengthen the presentation, we will add a dedicated analysis subsection detailing how the unaligned training enables this (with examples of consistent reconstruction under changed conditions) and include discussion of edge cases or potential limitations where disentanglement may be stressed, such as in highly cluttered urban scenes. This will clarify the link to the reported loop-closure and navigation results. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on experimental demonstrations of video generation using geo-registered but temporally unaligned corpora to achieve semantic disentanglement and 3D-consistent navigation. No load-bearing derivation, equation, or prediction is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The approach is described as leveraging external data properties for grounding, with results validated through generation tasks rather than tautological renaming or imported uniqueness theorems. This is a self-contained empirical modeling paper without detectable circular steps in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate any free parameters, axioms, or invented entities; the central claim rests on the existence and utility of geo-registered corpora and the effectiveness of temporally unaligned training for disentanglement, neither of which is formalized here.

pith-pipeline@v0.9.0 · 5501 in / 1193 out tokens · 44808 ms · 2026-05-10T02:05:31.073207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 35 canonical work pages · 17 internal anchors

  1. [1]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 72–79 (2009).https://doi.org/10.1109/ICCV.2009.5459148

  2. [2]

    In: CVPR (2025)

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: CVPR (2025)

  3. [3]

    Lumiere: A space-time diffusion model for video generation

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y., Michaeli, T., Wang, O., Sun, D., Dekel, T., Mosseri, I.: Lumiere: A space-time diffusion model for video generation. In: Proceedings of the 41st International Conference on Machine Learning (2024),https://arxiv.org/abs/ 2401.12945

  4. [4]

    arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079

    Bhat, S.F., Mitra, N.J., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depth conditioning. arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  6. [6]

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators

  7. [7]

    In: ICLR (2026)

    Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: ICLR (2026)

  8. [8]

    Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G

    Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G. Chou et al

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Video alchemist: Multi-subject open-set personalization in video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  10. [10]

    arXiv preprint arXiv:2602.06159 (2026)

    Chen, X., Zhang, C., Fu, C., Yang, Z., Zhou, K., Zhang, Y., He, J., Zhang, Y., Sun, M., Wang, Z., Dong, Z., Long, X., Meng, L.: Driving with dino: Vision foundation features as a unified bridge for sim-to-real generation in autonomous driving. arXiv preprint arXiv:2602.06159 (2026)

  11. [11]

    In: CVPR (2022)

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

  12. [12]

    Computer Vision and Image Un- derstanding259, 104445 (2025).https://doi.org/10.1016/j.cviu.2025.104445

    Chigot, E., Wilson, D.G., Ghrib, M., Oberlin, T.: Style transfer with diffusion models for synthetic-to-real domain adaptation. Computer Vision and Image Un- derstanding259, 104445 (2025).https://doi.org/10.1016/j.cviu.2025.104445

  13. [13]

    KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

    Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://arxiv.org/abs/2411.13549

  14. [14]

    googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

    DeepMind, G.: Veo: a text-to-video generation system (2025),https://storage. googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

  15. [15]

    In: ACM SIGGRAPH 2024 Conference Papers

    Deng, B., Tucker, R., Li, Z., Guibas, L., Snavely, N., Wetzstein, G.: Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

  17. [17]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)

  18. [18]

    arXiv preprint arXiv:2601.05239 (2025)

    Fu, X., Tang, S., Shi, M., Liu, X., Gu, J., Liu, M.Y., Lin, D., Lin, C.H.: Plenoptic video generation. arXiv preprint arXiv:2601.05239 (2025)

  19. [19]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.C., Dong, Y., Mo, K., Lin, C.H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y., Zheng, R., Niu, D., Tan, Y.L., Zentner, K., Kurian, G., Indupuru, S., Jannaty, P., Gu, J., Zhang, J., Malik, J., Abbeel, P., Liu, M.Y., Zhu, Y., Jang, J., Fan, L.J.: Dreamdojo: A generalist robot world model f...

  20. [20]

    In: International Conference on Learning Rep- resentations (2024)

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: International Conference on Learning Rep- resentations (2024)

  21. [21]

    Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,

    Girdhar, R., Singh, M., Brown, A., et al.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023), https://arxiv.org/abs/2311.10709

  22. [22]

    arXiv preprint arXiv:2501.03847 (2025)

    Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)

  23. [23]

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023) CityRAG 17

  24. [24]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  25. [25]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022),https://arxiv. org/abs/2210.02303

  26. [26]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  27. [27]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  28. [28]

    Jin, Y., Sun, Z., Li, N., Xu, K., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)

  29. [29]

    ACM Transactions on Graphics42(4) (2023)

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)

  30. [30]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna- tional Conference on Learning Representations (ICLR) (2015),https://arxiv. org/abs/1412.6980

  31. [31]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  32. [32]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Ku, M., Wei, C., Ren, W., Yang, H., Chen, W., Liang, Y., Zheng, T., Guo, M., Zhao, X., Sang, J., Yang, M.H., Chen, W.: Anyv2v: A plug-and-play framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    gi Kwak, J., Dong, E., Jin, Y., Ko, H., Mahajan, S., Yi, K.M.: Vivid-1-to-3: Novel view synthesis with video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6775–6785 (2024)

  34. [34]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  35. [35]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33, pp. 9459–9474 (2020)

  36. [36]

    arXiv preprint arXiv:2405.15757 (2024)

    Liang,F.,Kodaira,A.,Xu,C.,Tomizuka,M.,Keutzer,K.,Marculescu,D.:Looking backward: Streaming video-to-video translation with feature banks. arXiv preprint arXiv:2405.15757 (2024)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Liang, F., Wu, B., Wang, J., Yu, L., Li, K., Zhao, Y., Misra, I., Huang, J.B., Zhang, P., Vajda, P., Marculescu, D.: Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  38. [38]

    Muon is Scalable for LLM Training

    Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Yan, J., Chen, Y., Zheng, H., Liu, Y., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y., Wang, J., Dong, M., Zhang, Z., Kang, Y., Zhang, H., Xu, X., Zhang, Y., Wu, Y., Zhou, X., Yang, Z.: Muon is scalable for llm training (2025),https: //arxiv.org/abs/2502.16982 18 G. Chou et al

  39. [39]

    R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

    Ljungbergh, W., Taveira, B., Zheng, W., Tonderski, A., Peng, C., Kahl, F., Pe- tersson, C., Felsberg, M., Keutzer, K., Tomizuka, M., Zhan, W.: R3d2: Realistic 3d asset insertion via diffusion for autonomous driving simulation. arXiv (2025). https://doi.org/10.48550/arxiv.2506.07826

  40. [40]

    LUMA: Luma dream machine (2024),https://lumalabs.ai/dream-machine

  41. [41]

    Realrag: Retrieval-augmented realistic image generation via self-reflective contrastive learning.arXiv preprint arXiv:2502.00848, 2025

    Lyu, Y., Zheng, X., Jiang, L., Yan, Y., Zou, X., Zhou, H., Zhang, L., Hu, X.: Re- alrag: Retrieval-augmented realistic image generation via self-reflective contrastive learning. In: Proceedings of the 42nd International Conference on Machine Learn- ing (ICML) (2025),https://arxiv.org/abs/2502.00848

  42. [42]

    Movie Gen: A Cast of Media Foundation Models

    team at Meta, T.M.G.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024),https://arxiv.org/abs/2410.13720

  43. [43]

    ICCV (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. ICCV (2021)

  44. [44]

    com/krea-ai/realtime-video

    Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video

  45. [45]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (2023)

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  47. [47]

    In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C

    Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion mod- els. In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C

  48. [48]

    In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  49. [49]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Seed, B.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025),https://arxiv.org/abs/2506.09113

  50. [50]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Shaham, D., Saharia, C., Chan, W., Norouzi, M.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022),https://arxiv.org/abs/2209.14792

  51. [51]

    Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757

    Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757

  52. [52]

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.06764

  53. [53]

    In: CVPR (2022)

    Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Bar- ron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: CVPR (2022)

  54. [54]

    In: ECCV (2024)

    Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV (2024)

  55. [55]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 12922–12931 (2022)

  56. [56]

    In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19

    Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19

  57. [57]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  58. [58]

    arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962

    Wang, Z., Li, A., Zhu, L., Guo, Y., Dou, Q., Li, Z.: Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962

  59. [59]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  60. [60]

    In: ACM SIGGRAPH 2024 Conference Papers (2023)

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2023)

  61. [61]

    Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation

    Waymo: The waymo world model: A new frontier for autonomous driving simu- lation. Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wei, Y., Zhang, S., Qing, Z., Yuan, H., Liu, Z., Liu, Y., Zhang, Y., Zhou, J., Shan, H.: Dreamvideo: Composing your dream videos with customized subject and motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6537–6549 (2024)

  63. [63]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Wu, B., Chuang, C.Y., Wang, X., Jia, Y., Krishnakumar, K., Xiao, T., Liang, F., Yu, L., Vajda, P.: Fairy: Fast parallelized instruction-guided video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  64. [64]

    Wu,T.,Yang,S.,Po,R.,Xu,Y.,Liu,Z.,Lin,D.,Wetzstein,G.:Videoworldmodels with long-term spatial memory (2025),https://arxiv.org/abs/2506.05284

  65. [65]

    Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory (2025),https://arxiv.org/ abs/2504.12369

  66. [66]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W., Zhang, Y.: Instadrive: Instance- aware driving world models for realistic and consistent video generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 25410–25420 (2025)

  67. [67]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  68. [68]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 100–111 (October 2025)

  69. [69]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  70. [70]

    In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G

    Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G. Chou et al

  71. [71]

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

  72. [72]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  73. [73]

    Zhao, H., Weng, H., Lu, D., Li, A., Li, J., Panda, A., Xie, S.: On scaling up 3d gaussian splatting training (2024),https://arxiv.org/abs/2406.18533

  74. [74]

    arXiv preprint (2025)

    Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv preprint (2025)

  75. [75]

    arXiv preprint arXiv:2406.09386 (2024)

    Zhou, Y., Simon, M., Peng, Z., Mo, S., Zhu, H., Guo, M., Zhou, B.: Simgen: Simulator-conditioned driving scene generation. arXiv preprint arXiv:2406.09386 (2024)

  76. [76]

    Which method has higher visual quality, or looks more realistic?

    Zhu, C., Wu, Y., Wang, S., Wu, G., Wang, L.: Motionrag: Motion retrieval- augmented image-to-video generation. In: Proceedings of the 39th International Conference on Neural Information Processing Systems (2025) A Appendix A: Details of User Study As mentioned in the main text, we conduct a user study to evaluate the capabil- ities and limitations of each...