arxiv: 2604.19741 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Gene Chou , Charles Herrmann , Kyle Genova , Boyang Deng , Songyou Peng , Bharath Hariharan , Jason Y. Zhang , Noah Snavely

show 1 more author

Philipp Henzler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generation3D consistencygeo-registered datacity simulationspatially groundedloop closurevideo diffusionautonomous driving simulation

0 comments

The pith

CityRAG generates coherent minutes-long videos of real cities grounded in geo-registered data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CityRAG to solve the challenge of creating video simulations that match real city locations in 3D space. It does this by feeding the model with lots of location-specific but not time-matched video clips, allowing it to learn the permanent features of the scene separately from temporary ones like cars or rain. A reader would care because this enables creating useful, realistic virtual environments for testing self-driving cars or robots without collecting perfectly matched data for every scenario. The results show it can keep videos consistent for minutes while following real paths and even looping back correctly.

Core claim

CityRAG is a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. It relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Experiments demonstrate that it can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

What carries the argument

CityRAG, a video generative model conditioned on geo-registered data to enforce spatial grounding and 3D consistency across long sequences.

Load-bearing premise

That large corpora of geo-registered but temporally unaligned data are sufficient to teach the model to semantically disentangle the underlying scene from transient attributes while still producing 3D-consistent navigation over long sequences.

What would settle it

A test video sequence following a known real-world trajectory that shows drift in 3D geometry or failure to close loops despite the geo-conditioning.

Figures

Figures reproduced from arXiv: 2604.19741 by Bharath Hariharan, Boyang Deng, Charles Herrmann, Gene Chou, Jason Y. Zhang, Kyle Genova, Noah Snavely, Philipp Henzler, Songyou Peng.

**Figure 1.** Figure 1: CityRAG generates minutes-long, spatially grounded video sequences that 1) render real buildings, traffic lights, and roads of a city; 2) follow a user-defined path and perform loop closure after generating a thousand frames; 3) are initialized from a first image and respects its weather conditions and dynamic objects. Top: Panthéon and surrounding buildings, Paris. Middle: Calle Quiñones St, San Juan. Bot… view at source ↗

**Figure 2.** Figure 2: Training data pipeline. We use Street View data in the form of panoramas. We create a training pair if there is a continuous path where there exists 2 sets of captures at different times (e.g., morning vs. afternoon) but with an average distance < 5 meters, so the model learns to disentangle static and transient attributes, e.g., roads and buildings (green box) vs. weather and cars (red box). easily access… view at source ↗

**Figure 3.** Figure 3: Architecture of Video Generation in CityRAG. The generator takes three conditions: the first image, a trajectory, and geo-registered data in the form of video frames (denoted geospatial conditioning) along the trajectory. We finetune from a state-ofthe-art image-to-video (I2V) generative model, Wan 2.1 (14B). It consists of a spatiotemporal VAE and a DiTbased diffusion model. The condition inputs to th… view at source ↗

**Figure 4.** Figure 4: RAG pipeline at inference-time. The user first selects a location and image that they want to step into. Then with a user-specified trajectory we use the Street View Database to retrieve our geospatial conditioning. All conditions are passed to the video model which generates the output the user sees. We then automatically update the first frame and location and repeat the process. Same location, with a 90… view at source ↗

**Figure 5.** Figure 5: Making geospatial retrieval work for arbitrary trajectories. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons. We show three challenging test samples. Input conditions include the video for geospatial conditioning (leftmost column), the first image of the ground truth video (rightmost column), and the trajectory defined by the ground truth video. Scene A: Our method consistently follows the weather and the black car in front. Scene B: Our method reconstructs buildings (t=7s) that appear lat… view at source ↗

**Figure 7.** Figure 7: User study results. Users were asked to rate each video on a scale of 1 (lowest) to 3 (highest). Larger size indicates higher visual quality. Only our method generates a video that is both a smooth continuation from the first image, and a faithful render of the real physical location. The questions ask users to evaluate the generated videos’ visual quality, whether they are smooth continuations of the f… view at source ↗

**Figure 8.** Figure 8: Flexibility of trajectory conditioning. Our trajectory conditioning does not have to be precisely aligned with the geospatial conditioning. Left: Even though there is a translation mismatch between the geospatial (car stuck in trajectory) and trajectory left turn, our model correctly generates a plausible sequence following the trajectory, despite the mismatch. Right: Our model is capable of rotating 360 d… view at source ↗

**Figure 9.** Figure 9: User study Q1. 3: Visually similar and most people would agree that they belong to the same location. There can be noticeable distortions or artifacts. 2: There are some similarities, but might not be the same location. Maybe contains distortions or artifacts. 1: Distinctly different. Likely two completely different locations. B Appendix B: Ethics and Privacy As CityRAG aims to generate realistic videos o… view at source ↗

**Figure 10.** Figure 10: User study Q2. B.2 Bias in Data Distribution Although we collected data from 10 cities, across 4 continents, the majority of the data is located in Western countries. This could introduce representation bias. Though CityRAG is a research paper without direct use in products or applications, in the future, any follow up work should attempt to mitigate this bias via more diverse data collection or algorithm… view at source ↗

**Figure 11.** Figure 11: User study Q3 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CityRAG's geo-registered unaligned data strategy for disentangled long-horizon video is the real hook, though the abstract leaves the results unquantified.

read the letter

CityRAG's core move is to feed geo-registered but temporally unaligned video corpora into a generative model so it learns to separate fixed scene geometry from changing weather, lighting, and objects. That produces minutes-long sequences that stay 3D-consistent, close loops, and follow real trajectories while respecting actual city layout. The abstract positions this as an advance over plain T2V or I2V models by adding external spatial grounding and using misalignment as a feature for disentanglement rather than a bug. If the training actually delivers what is claimed, the result would be directly usable for driving and robotics simulators that need variable conditions without drifting off the real map. The paper does a clean job of naming the practical gap and sketching a data-driven way to close it. The main weakness is that the abstract supplies no numbers, no ablations, and no error breakdowns, so the strong claims about coherence over thousands of frames and accurate geography reconstruction cannot be checked yet. The assumption that unaligned geo data alone suffices for semantic separation and long-term 3D stability is plausible on paper but remains untested in the summary we have. This work is aimed at people building simulation pipelines or extending video models with external maps. A reader who cares about grounded generation for downstream robotics would find the data recipe worth examining. It is coherent enough on its own terms to deserve a full referee process so the missing quantitative evidence can be supplied and reviewed.

Referee Report

2 major / 1 minor

Summary. The paper introduces CityRAG, a video generative model that uses large corpora of geo-registered but temporally unaligned data as context to produce 3D-consistent, navigable video sequences simulating real-world locations. It claims to achieve semantic disentanglement of scene structure from transient attributes, enabling minutes-long coherent generation, maintenance of weather/lighting over thousands of frames, loop closure, and navigation along complex trajectories to reconstruct real geography, with applications to autonomous driving and robotics simulation.

Significance. If the central claims hold with rigorous validation, CityRAG would represent a meaningful advance in grounded video generation by demonstrating that unaligned geo-registered corpora can support long-horizon physical consistency without explicit 3D supervision or temporal alignment. This could impact simulation pipelines where real-world spatial grounding is required.

major comments (2)

[Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.
[Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.

minor comments (1)

[Abstract] The abstract uses terms such as 'physically grounded' and 'real-world geography' without defining the precise criteria or evaluation protocol used to verify them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. The comments highlight important aspects of validation for our claims regarding long-horizon consistency and disentanglement. We address each point below and commit to revisions that strengthen the manuscript's rigor without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.

Authors: We agree that the abstract presents these outcomes at a summary level without embedded numerical results or direct comparisons. The full manuscript supports the claims through extensive qualitative demonstrations of minute-long coherent sequences, loop closure under navigation, and preservation of scene structure across varying conditions. To directly address the concern, we will revise the abstract to include a concise summary of supporting evidence (e.g., frame counts and observed consistency indicators) and expand the experiments section with quantitative metrics where feasible for generative video tasks, baseline comparisons against ungrounded models, and ablation studies on the role of geo-registration. This will enable clearer assessment of the physical grounding and disentanglement. revision: yes
Referee: [Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.

Authors: We recognize that the abstract does not isolate an explicit test or counter-example for the disentanglement assumption. The manuscript's design relies on training with temporally unaligned geo-registered corpora to promote separation of fixed scene elements from transients, which is reflected in the generated outputs maintaining geometry while varying weather and dynamics. To strengthen the presentation, we will add a dedicated analysis subsection detailing how the unaligned training enables this (with examples of consistent reconstruction under changed conditions) and include discussion of edge cases or potential limitations where disentanglement may be stressed, such as in highly cluttered urban scenes. This will clarify the link to the reported loop-closure and navigation results. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on experimental demonstrations of video generation using geo-registered but temporally unaligned corpora to achieve semantic disentanglement and 3D-consistent navigation. No load-bearing derivation, equation, or prediction is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The approach is described as leveraging external data properties for grounding, with results validated through generation tasks rather than tautological renaming or imported uniqueness theorems. This is a self-contained empirical modeling paper without detectable circular steps in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate any free parameters, axioms, or invented entities; the central claim rests on the existence and utility of geo-registered corpora and the effectiveness of temporally unaligned training for disentanglement, neither of which is formalized here.

pith-pipeline@v0.9.0 · 5501 in / 1193 out tokens · 44808 ms · 2026-05-10T02:05:31.073207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 35 canonical work pages · 17 internal anchors

[1]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 72–79 (2009).https://doi.org/10.1109/ICCV.2009.5459148

work page doi:10.1109/iccv.2009.5459148 2009
[2]

In: CVPR (2025)

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: CVPR (2025)

2025
[3]

Lumiere: A space-time diffusion model for video generation

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y., Michaeli, T., Wang, O., Sun, D., Dekel, T., Mosseri, I.: Lumiere: A space-time diffusion model for video generation. In: Proceedings of the 41st International Conference on Machine Learning (2024),https://arxiv.org/abs/ 2401.12945

work page arXiv 2024
[4]

arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079

Bhat, S.F., Mitra, N.J., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depth conditioning. arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079

work page doi:10.48550/arxiv.2312.03079 2023
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review arXiv 2023
[6]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators

2024
[7]

In: ICLR (2026)

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: ICLR (2026)

2026
[8]

Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G. Chou et al

2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Video alchemist: Multi-subject open-set personalization in video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[10]

arXiv preprint arXiv:2602.06159 (2026)

Chen, X., Zhang, C., Fu, C., Yang, Z., Zhou, K., Zhang, Y., He, J., Zhang, Y., Sun, M., Wang, Z., Dong, Z., Long, X., Meng, L.: Driving with dino: Vision foundation features as a unified bridge for sim-to-real generation in autonomous driving. arXiv preprint arXiv:2602.06159 (2026)

work page arXiv 2026
[11]

In: CVPR (2022)

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

2022
[12]

Computer Vision and Image Un- derstanding259, 104445 (2025).https://doi.org/10.1016/j.cviu.2025.104445

Chigot, E., Wilson, D.G., Ghrib, M., Oberlin, T.: Style transfer with diffusion models for synthetic-to-real domain adaptation. Computer Vision and Image Un- derstanding259, 104445 (2025).https://doi.org/10.1016/j.cviu.2025.104445

work page doi:10.1016/j.cviu.2025.104445 2025
[13]

KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://arxiv.org/abs/2411.13549

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

DeepMind, G.: Veo: a text-to-video generation system (2025),https://storage. googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

2025
[15]

In: ACM SIGGRAPH 2024 Conference Papers

Deng, B., Tucker, R., Li, Z., Guibas, L., Snavely, N., Wetzstein, G.: Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

2024
[16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

2023
[17]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)

work page internal anchor Pith review arXiv 2024
[18]

arXiv preprint arXiv:2601.05239 (2025)

Fu, X., Tang, S., Shi, M., Liu, X., Gu, J., Liu, M.Y., Lin, D., Lin, C.H.: Plenoptic video generation. arXiv preprint arXiv:2601.05239 (2025)

work page arXiv 2025
[19]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.C., Dong, Y., Mo, K., Lin, C.H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y., Zheng, R., Niu, D., Tan, Y.L., Zentner, K., Kurian, G., Indupuru, S., Jannaty, P., Gu, J., Zhang, J., Malik, J., Abbeel, P., Liu, M.Y., Zhu, Y., Jang, J., Fan, L.J.: Dreamdojo: A generalist robot world model f...

work page arXiv 2026
[20]

In: International Conference on Learning Rep- resentations (2024)

Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: International Conference on Learning Rep- resentations (2024)

2024
[21]

Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,

Girdhar, R., Singh, M., Brown, A., et al.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023), https://arxiv.org/abs/2311.10709

work page arXiv 2023
[22]

arXiv preprint arXiv:2501.03847 (2025)

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)

work page arXiv 2025
[23]

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023) CityRAG 17

2023
[24]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[25]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022),https://arxiv. org/abs/2210.02303

work page internal anchor Pith review arXiv 2022
[26]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review arXiv 2022
[27]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review arXiv 2025
[28]

Jin, Y., Sun, Z., Li, N., Xu, K., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)

2024
[29]

ACM Transactions on Graphics42(4) (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)

2023
[30]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna- tional Conference on Learning Representations (ICLR) (2015),https://arxiv. org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W., Liang, Y., Zheng, T., Guo, M., Zhao, X., Sang, J., Yang, M.H., Chen, W.: Anyv2v: A plug-and-play framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

gi Kwak, J., Dong, E., Jin, Y., Ko, H., Mahajan, S., Yi, K.M.: Vivid-1-to-3: Novel view synthesis with video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6775–6785 (2024)

2024
[34]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review arXiv 2025
[35]

In: Advances in Neural Information Processing Systems (NeurIPS)

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33, pp. 9459–9474 (2020)

2020
[36]

arXiv preprint arXiv:2405.15757 (2024)

Liang,F.,Kodaira,A.,Xu,C.,Tomizuka,M.,Keutzer,K.,Marculescu,D.:Looking backward: Streaming video-to-video translation with feature banks. arXiv preprint arXiv:2405.15757 (2024)

work page arXiv 2024
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Liang, F., Wu, B., Wang, J., Yu, L., Li, K., Zhao, Y., Misra, I., Huang, J.B., Zhang, P., Vajda, P., Marculescu, D.: Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[38]

Muon is Scalable for LLM Training

Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Yan, J., Chen, Y., Zheng, H., Liu, Y., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y., Wang, J., Dong, M., Zhang, Z., Kang, Y., Zhang, H., Xu, X., Zhang, Y., Wu, Y., Zhou, X., Yang, Z.: Muon is scalable for llm training (2025),https: //arxiv.org/abs/2502.16982 18 G. Chou et al

work page internal anchor Pith review arXiv 2025
[39]

R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

Ljungbergh, W., Taveira, B., Zheng, W., Tonderski, A., Peng, C., Kahl, F., Pe- tersson, C., Felsberg, M., Keutzer, K., Tomizuka, M., Zhan, W.: R3d2: Realistic 3d asset insertion via diffusion for autonomous driving simulation. arXiv (2025). https://doi.org/10.48550/arxiv.2506.07826

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.07826 2025
[40]

LUMA: Luma dream machine (2024),https://lumalabs.ai/dream-machine

2024
[41]

Realrag: Retrieval-augmented realistic image generation via self-reflective contrastive learning.arXiv preprint arXiv:2502.00848, 2025

Lyu, Y., Zheng, X., Jiang, L., Yan, Y., Zou, X., Zhou, H., Zhang, L., Hu, X.: Re- alrag: Retrieval-augmented realistic image generation via self-reflective contrastive learning. In: Proceedings of the 42nd International Conference on Machine Learn- ing (ICML) (2025),https://arxiv.org/abs/2502.00848

work page arXiv 2025
[42]

Movie Gen: A Cast of Media Foundation Models

team at Meta, T.M.G.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024),https://arxiv.org/abs/2410.13720

work page internal anchor Pith review arXiv 2024
[43]

ICCV (2021)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. ICCV (2021)

2021
[44]

com/krea-ai/realtime-video

Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video

2025
[45]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (2023)

2023
[46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[47]

In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion mod- els. In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C

2022
[48]

In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

2016
[49]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seed, B.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025),https://arxiv.org/abs/2506.09113

work page internal anchor Pith review arXiv 2025
[50]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Shaham, D., Saharia, C., Chan, W., Norouzi, M.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022),https://arxiv.org/abs/2209.14792

work page internal anchor Pith review arXiv 2022
[51]

Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757

Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757

2006
[52]

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.06764

work page arXiv 2025
[53]

In: CVPR (2022)

Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Bar- ron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: CVPR (2022)

2022
[54]

In: ECCV (2024)

Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV (2024)

2024
[55]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 12922–12931 (2022)

2022
[56]

In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19

2024
[57]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962

Wang, Z., Li, A., Zhu, L., Guo, Y., Dou, Q., Li, Z.: Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962

work page arXiv 2024
[59]

IEEE transactions on image processing 13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

2004
[60]

In: ACM SIGGRAPH 2024 Conference Papers (2023)

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2023)

2024
[61]

Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation

Waymo: The waymo world model: A new frontier for autonomous driving simu- lation. Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation

2026
[62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wei, Y., Zhang, S., Qing, Z., Yuan, H., Liu, Z., Liu, Y., Zhang, Y., Zhou, J., Shan, H.: Dreamvideo: Composing your dream videos with customized subject and motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6537–6549 (2024)

2024
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Wu, B., Chuang, C.Y., Wang, X., Jia, Y., Krishnakumar, K., Xiao, T., Liang, F., Yu, L., Vajda, P.: Fairy: Fast parallelized instruction-guided video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[64]

Wu,T.,Yang,S.,Po,R.,Xu,Y.,Liu,Z.,Lin,D.,Wetzstein,G.:Videoworldmodels with long-term spatial memory (2025),https://arxiv.org/abs/2506.05284

work page arXiv 2025
[65]

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory (2025),https://arxiv.org/ abs/2504.12369

work page arXiv 2025
[66]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W., Zhang, Y.: Instadrive: Instance- aware driving world models for realistic and consistent video generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 25410–25420 (2025)

2025
[67]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review arXiv 2024
[68]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 100–111 (October 2025)

2025
[69]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

work page internal anchor Pith review arXiv 2024
[70]

In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G

Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G. Chou et al

2025
[71]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

2023
[72]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018
[73]

Zhao, H., Weng, H., Lu, D., Li, A., Li, J., Panda, A., Xie, S.: On scaling up 3d gaussian splatting training (2024),https://arxiv.org/abs/2406.18533

work page arXiv 2024
[74]

arXiv preprint (2025)

Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv preprint (2025)

2025
[75]

arXiv preprint arXiv:2406.09386 (2024)

Zhou, Y., Simon, M., Peng, Z., Mo, S., Zhu, H., Guo, M., Zhou, B.: Simgen: Simulator-conditioned driving scene generation. arXiv preprint arXiv:2406.09386 (2024)

work page arXiv 2024
[76]

Which method has higher visual quality, or looks more realistic?

Zhu, C., Wu, Y., Wang, S., Wu, G., Wang, L.: Motionrag: Motion retrieval- augmented image-to-video generation. In: Proceedings of the 39th International Conference on Neural Information Processing Systems (2025) A Appendix A: Details of User Study As mentioned in the main text, we conduct a user study to evaluate the capabil- ities and limitations of each...

2025