Recognition: unknown
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3
The pith
CityRAG generates coherent minutes-long videos of real cities grounded in geo-registered data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CityRAG is a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. It relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Experiments demonstrate that it can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
What carries the argument
CityRAG, a video generative model conditioned on geo-registered data to enforce spatial grounding and 3D consistency across long sequences.
Load-bearing premise
That large corpora of geo-registered but temporally unaligned data are sufficient to teach the model to semantically disentangle the underlying scene from transient attributes while still producing 3D-consistent navigation over long sequences.
What would settle it
A test video sequence following a known real-world trajectory that shows drift in 3D geometry or failure to close loops despite the geo-conditioning.
Figures
read the original abstract
We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CityRAG, a video generative model that uses large corpora of geo-registered but temporally unaligned data as context to produce 3D-consistent, navigable video sequences simulating real-world locations. It claims to achieve semantic disentanglement of scene structure from transient attributes, enabling minutes-long coherent generation, maintenance of weather/lighting over thousands of frames, loop closure, and navigation along complex trajectories to reconstruct real geography, with applications to autonomous driving and robotics simulation.
Significance. If the central claims hold with rigorous validation, CityRAG would represent a meaningful advance in grounded video generation by demonstrating that unaligned geo-registered corpora can support long-horizon physical consistency without explicit 3D supervision or temporal alignment. This could impact simulation pipelines where real-world spatial grounding is required.
major comments (2)
- [Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.
- [Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.
minor comments (1)
- [Abstract] The abstract uses terms such as 'physically grounded' and 'real-world geography' without defining the precise criteria or evaluation protocol used to verify them.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the recommendation for major revision. The comments highlight important aspects of validation for our claims regarding long-horizon consistency and disentanglement. We address each point below and commit to revisions that strengthen the manuscript's rigor without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central experimental claims (coherent minutes-long sequences, loop closure, real-world geography reconstruction, and maintenance of conditions over thousands of frames) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes it impossible to assess whether the data support the claims of physical grounding and disentanglement.
Authors: We agree that the abstract presents these outcomes at a summary level without embedded numerical results or direct comparisons. The full manuscript supports the claims through extensive qualitative demonstrations of minute-long coherent sequences, loop closure under navigation, and preservation of scene structure across varying conditions. To directly address the concern, we will revise the abstract to include a concise summary of supporting evidence (e.g., frame counts and observed consistency indicators) and expand the experiments section with quantitative metrics where feasible for generative video tasks, baseline comparisons against ungrounded models, and ablation studies on the role of geo-registration. This will enable clearer assessment of the physical grounding and disentanglement. revision: yes
-
Referee: [Abstract] The weakest assumption—that temporally unaligned geo-registered data alone suffices for semantic disentanglement while preserving 3D consistency over long horizons—is presented without a concrete test or counter-example analysis. If this does not hold, the loop-closure and trajectory-navigation results would not follow.
Authors: We recognize that the abstract does not isolate an explicit test or counter-example for the disentanglement assumption. The manuscript's design relies on training with temporally unaligned geo-registered corpora to promote separation of fixed scene elements from transients, which is reflected in the generated outputs maintaining geometry while varying weather and dynamics. To strengthen the presentation, we will add a dedicated analysis subsection detailing how the unaligned training enables this (with examples of consistent reconstruction under changed conditions) and include discussion of edge cases or potential limitations where disentanglement may be stressed, such as in highly cluttered urban scenes. This will clarify the link to the reported loop-closure and navigation results. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's central claims rest on experimental demonstrations of video generation using geo-registered but temporally unaligned corpora to achieve semantic disentanglement and 3D-consistent navigation. No load-bearing derivation, equation, or prediction is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The approach is described as leveraging external data properties for grounding, with results validated through generation tasks rather than tautological renaming or imported uniqueness theorems. This is a self-contained empirical modeling paper without detectable circular steps in the provided derivation outline.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 72–79 (2009).https://doi.org/10.1109/ICCV.2009.5459148
-
[2]
In: CVPR (2025)
Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: CVPR (2025)
2025
-
[3]
Lumiere: A space-time diffusion model for video generation
Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y., Michaeli, T., Wang, O., Sun, D., Dekel, T., Mosseri, I.: Lumiere: A space-time diffusion model for video generation. In: Proceedings of the 41st International Conference on Machine Learning (2024),https://arxiv.org/abs/ 2401.12945
-
[4]
arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079
Bhat, S.F., Mitra, N.J., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depth conditioning. arXiv preprint arXiv:2312.03079 (2023).https://doi.org/ 10.48550/arxiv.2312.03079
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review arXiv 2023
-
[6]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators
2024
-
[7]
In: ICLR (2026)
Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: ICLR (2026)
2026
-
[8]
Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2025) 16 G. Chou et al
2025
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)
Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Video alchemist: Multi-subject open-set personalization in video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)
2025
-
[10]
arXiv preprint arXiv:2602.06159 (2026)
Chen, X., Zhang, C., Fu, C., Yang, Z., Zhou, K., Zhang, Y., He, J., Zhang, Y., Sun, M., Wang, Z., Dong, Z., Long, X., Meng, L.: Driving with dino: Vision foundation features as a unified bridge for sim-to-real generation in autonomous driving. arXiv preprint arXiv:2602.06159 (2026)
-
[11]
In: CVPR (2022)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
2022
-
[12]
Chigot, E., Wilson, D.G., Ghrib, M., Oberlin, T.: Style transfer with diffusion models for synthetic-to-real domain adaptation. Computer Vision and Image Un- derstanding259, 104445 (2025).https://doi.org/10.1016/j.cviu.2025.104445
-
[13]
KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos
Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://arxiv.org/abs/2411.13549
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
DeepMind, G.: Veo: a text-to-video generation system (2025),https://storage. googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
2025
-
[15]
In: ACM SIGGRAPH 2024 Conference Papers
Deng, B., Tucker, R., Li, Z., Guibas, L., Snavely, N., Wetzstein, G.: Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[16]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
2023
-
[17]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)
work page internal anchor Pith review arXiv 2024
-
[18]
arXiv preprint arXiv:2601.05239 (2025)
Fu, X., Tang, S., Shi, M., Liu, X., Gu, J., Liu, M.Y., Lin, D., Lin, C.H.: Plenoptic video generation. arXiv preprint arXiv:2601.05239 (2025)
-
[19]
Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.C., Dong, Y., Mo, K., Lin, C.H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y., Zheng, R., Niu, D., Tan, Y.L., Zentner, K., Kurian, G., Indupuru, S., Jannaty, P., Gu, J., Zhang, J., Malik, J., Abbeel, P., Liu, M.Y., Zhu, Y., Jang, J., Fan, L.J.: Dreamdojo: A generalist robot world model f...
-
[20]
In: International Conference on Learning Rep- resentations (2024)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: International Conference on Learning Rep- resentations (2024)
2024
-
[21]
Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,
Girdhar, R., Singh, M., Brown, A., et al.: Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023), https://arxiv.org/abs/2311.10709
-
[22]
arXiv preprint arXiv:2501.03847 (2025)
Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)
-
[23]
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023) CityRAG 17
2023
-
[24]
Advances in neural information processing systems30(2017)
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
2017
-
[25]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022),https://arxiv. org/abs/2210.02303
work page internal anchor Pith review arXiv 2022
-
[26]
Classifier-Free Diffusion Guidance
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
work page internal anchor Pith review arXiv 2022
-
[27]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review arXiv 2025
-
[28]
Jin, Y., Sun, Z., Li, N., Xu, K., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)
2024
-
[29]
ACM Transactions on Graphics42(4) (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)
2023
-
[30]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna- tional Conference on Learning Representations (ICLR) (2015),https://arxiv. org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Ku, M., Wei, C., Ren, W., Yang, H., Chen, W., Liang, Y., Zheng, T., Guo, M., Zhao, X., Sang, J., Yang, M.H., Chen, W.: Anyv2v: A plug-and-play framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
gi Kwak, J., Dong, E., Jin, Y., Ko, H., Mahajan, S., Yi, K.M.: Vivid-1-to-3: Novel view synthesis with video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6775–6785 (2024)
2024
-
[34]
Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...
work page internal anchor Pith review arXiv 2025
-
[35]
In: Advances in Neural Information Processing Systems (NeurIPS)
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33, pp. 9459–9474 (2020)
2020
-
[36]
arXiv preprint arXiv:2405.15757 (2024)
Liang,F.,Kodaira,A.,Xu,C.,Tomizuka,M.,Keutzer,K.,Marculescu,D.:Looking backward: Streaming video-to-video translation with feature banks. arXiv preprint arXiv:2405.15757 (2024)
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Liang, F., Wu, B., Wang, J., Yu, L., Li, K., Zhao, Y., Misra, I., Huang, J.B., Zhang, P., Vajda, P., Marculescu, D.: Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
2024
-
[38]
Muon is Scalable for LLM Training
Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Yan, J., Chen, Y., Zheng, H., Liu, Y., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y., Wang, J., Dong, M., Zhang, Z., Kang, Y., Zhang, H., Xu, X., Zhang, Y., Wu, Y., Zhou, X., Yang, Z.: Muon is scalable for llm training (2025),https: //arxiv.org/abs/2502.16982 18 G. Chou et al
work page internal anchor Pith review arXiv 2025
-
[39]
R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation
Ljungbergh, W., Taveira, B., Zheng, W., Tonderski, A., Peng, C., Kahl, F., Pe- tersson, C., Felsberg, M., Keutzer, K., Tomizuka, M., Zhan, W.: R3d2: Realistic 3d asset insertion via diffusion for autonomous driving simulation. arXiv (2025). https://doi.org/10.48550/arxiv.2506.07826
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.07826 2025
-
[40]
LUMA: Luma dream machine (2024),https://lumalabs.ai/dream-machine
2024
-
[41]
Lyu, Y., Zheng, X., Jiang, L., Yan, Y., Zou, X., Zhou, H., Zhang, L., Hu, X.: Re- alrag: Retrieval-augmented realistic image generation via self-reflective contrastive learning. In: Proceedings of the 42nd International Conference on Machine Learn- ing (ICML) (2025),https://arxiv.org/abs/2502.00848
-
[42]
Movie Gen: A Cast of Media Foundation Models
team at Meta, T.M.G.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024),https://arxiv.org/abs/2410.13720
work page internal anchor Pith review arXiv 2024
-
[43]
ICCV (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. ICCV (2021)
2021
-
[44]
com/krea-ai/realtime-video
Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video
2025
-
[45]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (2023)
2023
-
[46]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)
Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)
2025
-
[47]
In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion mod- els. In: International Conference on Learning Representations (2022),https: //openreview.net/forum?id=mFppY38Z36C
2022
-
[48]
In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
2016
-
[49]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Seed, B.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025),https://arxiv.org/abs/2506.09113
work page internal anchor Pith review arXiv 2025
-
[50]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Shaham, D., Saharia, C., Chan, W., Norouzi, M.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022),https://arxiv.org/abs/2209.14792
work page internal anchor Pith review arXiv 2022
-
[51]
Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2006),https: //api.semanticscholar.org/CorpusID:13385757
2006
- [52]
-
[53]
In: CVPR (2022)
Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Bar- ron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: CVPR (2022)
2022
-
[54]
In: ECCV (2024)
Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV (2024)
2024
-
[55]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 12922–12931 (2022)
2022
-
[56]
In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19
Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision (ECCV) (2024) CityRAG 19
2024
-
[57]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962
Wang, Z., Li, A., Zhu, L., Guo, Y., Dou, Q., Li, Z.: Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962 (2024),https://arxiv.org/abs/2401.09962
-
[59]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
2004
-
[60]
In: ACM SIGGRAPH 2024 Conference Papers (2023)
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2023)
2024
-
[61]
Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation
Waymo: The waymo world model: A new frontier for autonomous driving simu- lation. Waymo Blog (February 2026),https://waymo.com/blog/2026/02/the- waymo-world-model-a-new-frontier-for-autonomous-driving-simulation
2026
-
[62]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wei, Y., Zhang, S., Qing, Z., Yuan, H., Liu, Z., Liu, Y., Zhang, Y., Zhou, J., Shan, H.: Dreamvideo: Composing your dream videos with customized subject and motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6537–6549 (2024)
2024
-
[63]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Wu, B., Chuang, C.Y., Wang, X., Jia, Y., Krishnakumar, K., Xiao, T., Liang, F., Yu, L., Vajda, P.: Fairy: Fast parallelized instruction-guided video-to-video synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
2024
- [64]
- [65]
-
[66]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W., Zhang, Y.: Instadrive: Instance- aware driving world models for realistic and consistent video generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 25410–25420 (2025)
2025
-
[67]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review arXiv 2024
-
[68]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 100–111 (October 2025)
2025
-
[69]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)
work page internal anchor Pith review arXiv 2024
-
[70]
In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G
Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025) 20 G. Chou et al
2025
-
[71]
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
2023
-
[72]
In: CVPR (2018)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
2018
- [73]
-
[74]
arXiv preprint (2025)
Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv preprint (2025)
2025
-
[75]
arXiv preprint arXiv:2406.09386 (2024)
Zhou, Y., Simon, M., Peng, Z., Mo, S., Zhu, H., Guo, M., Zhou, B.: Simgen: Simulator-conditioned driving scene generation. arXiv preprint arXiv:2406.09386 (2024)
-
[76]
Which method has higher visual quality, or looks more realistic?
Zhu, C., Wu, Y., Wang, S., Wu, G., Wang, L.: Motionrag: Motion retrieval- augmented image-to-video generation. In: Proceedings of the 39th International Conference on Neural Information Processing Systems (2025) A Appendix A: Details of User Study As mentioned in the main text, we conduct a user study to evaluate the capabil- ities and limitations of each...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.