Recognition: 2 theorem links
· Lean TheoremGTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3
The pith
GTA generates 3D worlds from single images by first creating coarse geometry then synthesizing appearance with separate video diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a single input image, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, it introduces a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance.
What carries the argument
Two sequential video diffusion models that first predict coarse geometry from novel views and then generate appearance conditioned on the predicted geometry.
If this is right
- Synthesized 3D scenes exhibit higher structural fidelity and cross-view consistency.
- The method outperforms prior image-to-3D approaches on fidelity, visual quality, and geometric accuracy metrics.
- GTA functions as a plug-in enhancement that raises output quality of existing image-to-3D pipelines.
- It supports downstream tasks in spatial intelligence, embodied intelligence, and autonomous driving.
- Training shows favorable data efficiency compared with single-stage alternatives.
Where Pith is reading between the lines
- Future systems could treat geometry and appearance modules as independently upgradable components.
- The same staged pipeline could be tested on text or video inputs for broader 3D generation.
- Explicit geometry output may integrate more cleanly with physics simulators or robotics planning.
- Data-efficiency claims invite direct measurement of training curves against single-stage baselines on fixed compute budgets.
Load-bearing premise
Separating geometry prediction from appearance synthesis in a two-stage video diffusion pipeline will reliably raise structural fidelity and view consistency without creating new inconsistencies.
What would settle it
A controlled experiment in which the two-stage GTA model produces equal or lower geometric accuracy scores and more cross-view inconsistencies than a single unified diffusion baseline on the same benchmark dataset would falsify the central claim.
read the original abstract
Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GTA, a two-stage image-to-3D world generation framework using dedicated video diffusion models: the first generates coarse geometric structure from novel viewpoints given a single input image, and the second synthesizes fine-grained appearance conditioned on the predicted geometry. It introduces a random latent shuffle strategy during training for cross-view consistency and a test-time scaling scheme, claiming consistent outperformance over prior methods in fidelity, visual quality, and geometric accuracy, plus utility as a general enhancement module for existing pipelines and support for downstream tasks.
Significance. If the two-stage separation reliably improves structural fidelity without error propagation, the work could advance 3D generation by better aligning with human visual perception principles, offering a versatile plug-in that enhances multiple image-to-3D pipelines while maintaining data efficiency.
major comments (2)
- [Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.
- [Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.
minor comments (2)
- [Abstract] Abstract: the assertion of consistent outperformance lacks any mention of specific metrics, baselines, or quantitative gains, which should be summarized briefly for immediate clarity.
- [Figures] Figure captions and notation: the distinction between the two video diffusion models could be labeled more explicitly in diagrams to improve readability of the pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.
Authors: We appreciate the referee highlighting this point. While our reported metrics on final renderings (including depth-related and consistency measures) already reflect the downstream impact of the geometry stage, we agree that standalone evaluation of the geometry predictions would more directly address error-propagation concerns. In the revised manuscript we will add per-view depth error, multi-view consistency scores, and 3D point-cloud alignment metrics computed on the outputs of the first-stage geometry model. revision: yes
-
Referee: [Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.
Authors: We acknowledge that the current ablation suite does not include a controlled comparison that isolates the geometry stage. To clarify the contribution of the Geometry-Then-Appearance separation, we will add new experiments in the revision that (i) disable the geometry stage and generate appearance directly from the input image and (ii) condition the appearance model on ground-truth geometry when available. These ablations will quantify both the benefit of the two-stage design and any view-dependent artifacts arising from imperfect geometry predictions. revision: yes
Circularity Check
No significant circularity in the proposed Geometry-Then-Appearance framework
full rationale
The paper proposes an independent two-stage architectural design consisting of separate video diffusion models for coarse geometry generation followed by appearance synthesis conditioned on the geometry output. This choice is explicitly motivated by the coarse-to-fine structure of human visual perception rather than derived from any equations, fitted parameters, or self-citations that reduce the claims to inputs by construction. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The central claims rest on empirical outperformance and versatility as a plug-in module, with no self-referential reductions in the derivation chain. This matches the expected honest non-finding for a standard method proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human visual perception processes scenes in a coarse-to-fine manner
- domain assumption Video diffusion models can produce consistent novel-view geometry and appearance
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
motivated by the coarse-to-fine nature of human visual perception
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp
Butime, J., Gutierrez, I., Corzo, L.G., Espronceda, C.F.: 3d reconstruction meth- ods, a survey. In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp. 457–463 (2006)
work page 2006
-
[2]
arXiv preprint arXiv:2401.17807 (2024)
Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., Liang, Z., Liao, J., Cao, Y.-P., Shan, Y.: Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807 (2024)
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Peng, S., Genova, K., Jiang, C., Tagliasac- chi, A., Pollefeys, M., Funkhouser, T.,et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)
work page 2023
-
[4]
Artificial Intelligence Review56(9), 9175–9219 (2023)
Samavati, T., Soryani, M.: Deep learning- based 3d reconstruction: a survey. Artificial Intelligence Review56(9), 9175–9219 (2023)
work page 2023
-
[5]
3d scene generation: A survey,
Wen, B., Xie, H., Chen, Z., Hong, F., Liu, Z.: 3d scene generation: A survey. arXiv preprint arXiv:2505.05474 (2025)
-
[6]
International Journal of Computer Vision112(2), 188–203 (2015)
Zia, M.Z., Stark, M., Schindler, K.: Towards scene understanding with detailed 3d object representations. International Journal of Computer Vision112(2), 188–203 (2015)
work page 2015
-
[7]
International Journal of Computer Vision132(10), 4456–4472 (2024) 20
Li, M., Zhou, P., Liu, J.-W., Keppo, J., Lin, M., Yan, S., Xu, X.: Instant3d: Instant text- to-3d generation. International Journal of Computer Vision132(10), 4456–4472 (2024) 20
work page 2024
-
[8]
Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)
Moschoglou, S., Ploumpis, S., Nicolaou, M.A., Papaioannou, A., Zafeiriou, S.: 3dface- gan: Adversarial nets for 3d face represen- tation, generation, and translation. Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)
work page 2020
-
[9]
Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)
Di, D., Yang, J., Luo, C., Xue, Z., Chen, W., Yang, X., Gao, Y.: Hyper-3dg: Text-to-3d gaussian generation via hypergraph. Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)
work page 2025
-
[10]
Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)
Song, W., Zhang, X., Guo, Y., Li, S., Hao, A., Qin, H.: Automatic generation of 3d scene animation based on dynamic knowledge graphs and contextual encoding. Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)
work page 2023
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Yang, Y., Sun, F.-Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L.,et al.: Holodeck: Language guided generation of 3d embod- ied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16227–16237 (2024)
work page 2024
-
[12]
arXiv preprint arXiv:2506.10600 (2025)
Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodied- gen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)
-
[13]
arXiv preprint arXiv:2509.07996 (2025) 2, 4
Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., et al.: 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996 (2025)
-
[14]
ACM Computing Surveys 58(3), 1–38 (2025)
Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N.,et al.: Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys 58(3), 1–38 (2025)
work page 2025
-
[15]
In: European Conference on Computer Vision, pp
Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld: Learning a 3d occupancy world model for autonomous driv- ing. In: European Conference on Computer Vision, pp. 55–72 (2024). Springer
work page 2024
-
[16]
DreamFusion: Text-to-3D using 2D Diffusion
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffu- sion. arXiv preprint arXiv:2209.14988 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.-Y., Lin, T.-Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
work page 2023
-
[18]
In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp
Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J.,et al.: Dream- booth3d: Subject-driven text-to-3d genera- tion. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 2349–2359 (2023)
work page 2023
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
work page 2023
-
[20]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
In: European Conference on Com- puter Vision, pp
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Com- puter Vision, pp. 393–411 (2024). Springer
work page 2024
-
[22]
In: SIG- GRAPH Asia 2024 Conference Papers, pp
Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A.,et al.: Lumiere: A space-time diffusion model for video generation. In: SIG- GRAPH Asia 2024 Conference Papers, pp. 1–11 (2024)
work page 2024
-
[23]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 21
work page internal anchor Pith review arXiv 2024
-
[24]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)
work page internal anchor Pith review arXiv 2023
-
[25]
Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)
-
[26]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video dif- fusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 22875–22889 (2025)
work page 2025
-
[27]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single image. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 798–810 (2025)
work page 2025
-
[28]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.-T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Zhang, S., Xu, H., Guo, S., Xie, Z., Bao, H., Xu, W., Zou, C.: Spatialcrafter: Unleashing the imagination of video diffu- sion models for scene reconstruction from limited observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 27794–27805 (2025)
work page 2025
-
[30]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
In: ACM SIGGRAPH 2024 Conference Papers, pp
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)
work page 2024
-
[32]
ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)
Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W.,et al.: Voyager: Long- range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)
work page 2025
-
[33]
Dai, Y., Jiang, F., Wang, C., Xu, M., Qi, Y.: Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction. arXiv preprint arXiv:2509.21657 (2025)
-
[34]
arXiv preprint arXiv:2511.23127 (2025)
Zhang, H., Chen, K., Zhang, Z., Chen, H.H., Lyu, Y., Zhang, Y., Yang, S., Zhou, K., Chen, Y.: Dualcamctrl: Dual-branch dif- fusion model for geometry-aware camera- controlled video generation. arXiv preprint arXiv:2511.23127 (2025)
-
[35]
Cognitive psychology9(3), 353–383 (1977)
Navon, D.: Forest before trees: The prece- dence of global features in visual perception. Cognitive psychology9(3), 353–383 (1977)
work page 1977
-
[36]
Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)
Watt, R.: Scanning from coarse to fine spa- tial scales in the human visual system after the onset of a stimulus. Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)
work page 2006
-
[37]
ACM Computing Surveys57(2), 1–42 (2024)
Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)
work page 2024
-
[38]
ACM Computing Surveys58(6), 1–35 (2025)
Waseem, F., Shahzad, M.: Video is worth a thousand images: Exploring the latest trends in long video generation. ACM Computing Surveys58(6), 1–35 (2025)
work page 2025
-
[39]
Artificial Intelligence Review58(11), 338 (2025)
Ma, W., Yang, X., Jiao, L., Li, L., Liu, X., Liu, F., Chen, P., Yang, Y., Ma, M., Sun, L.,et al.: Video diffusion generation: compre- hensive review and open problems. Artificial Intelligence Review58(11), 338 (2025)
work page 2025
-
[40]
In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp
Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp. 1526–1535 (2018)
work page 2018
-
[41]
In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol
Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
work page 2018
-
[42]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Advances in Neural Information Processing Systems37, 29489– 29513 (2024)
Tian, Y., Yang, L., Yang, H., Gao, Y., Deng, Y., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P.,et al.: Videotetris: Towards compositional text-to-video generation. Advances in Neural Information Processing Systems37, 29489– 29513 (2024)
work page 2024
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., Perez-Rua, J.-M.: Gentron: Diffusion trans- formers for image and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6441–6451 (2024)
work page 2024
-
[45]
International Journal of Computer Vision 133(5), 3059–3078 (2025)
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P.,et al.: Lavie: High-quality video genera- tion with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)
work page 2025
-
[46]
In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp
Zeng, Y., Wei, G., Zheng, J., Zou, J., Wei, Y., Zhang, Y., Li, H.: Make pixels dance: High-dynamic video generation. In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 8850–8860 (2024)
work page 2024
-
[47]
arXiv preprint arXiv:2507.16869 (2025)
Ma, Y., Feng, K., Hu, Z., Wang, X., Wang, Y., Zheng, M., He, X., Zhu, C., Liu, H., He, Y., et al.: Controllable video generation: A survey. arXiv preprint arXiv:2507.16869 (2025)
-
[48]
In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp
Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 7854–7863 (2018)
work page 2018
-
[49]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18219–18228 (2022)
work page 2022
-
[50]
In: ACM SIG- GRAPH 2024 Conference Papers, pp
Shi, X., Huang, Z., Wang, F.-Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H.,et al.: Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIG- GRAPH 2024 Conference Papers, pp. 1–11 (2024)
work page 2024
-
[51]
arXiv preprint arXiv:2403.16407 (2024)
Li, C., Huang, D., Lu, Z., Xiao, Y., Pei, Q., Bai, L.: A survey on long video generation: Challenges, methods, and prospects. arXiv preprint arXiv:2403.16407 (2024)
-
[52]
In: European Conference on Computer Vision, pp
Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.-B., Parikh, D.: Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, pp. 102–118 (2022). Springer
work page 2022
-
[53]
In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp
Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 2568–2577 (2025)
work page 2025
-
[54]
Advances in Neural Information Processing Systems 37, 131434–131455 (2024)
Lu, Y., Liang, Y., Zhu, L., Yang, Y.: Free- long: Training-free long video generation with spectralblend temporal attention. Advances in Neural Information Processing Systems 37, 131434–131455 (2024)
work page 2024
-
[55]
Bidirectional sparse attention for faster video diffusion training.arXiv preprint arXiv:2509.01085,
Zhan, C., Li, W., Shen, C., Zhang, J., Wu, S., Zhang, H.: Bidirectional sparse attention for faster video diffusion training. arXiv preprint arXiv:2509.01085 (2025)
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp
Xing, Z., Dai, Q., Hu, H., Wu, Z., Jiang, Y.-G.: Simda: Simple diffusion adapter for efficient video generation. In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp. 7827– 7839 (2024)
work page 2024
-
[57]
arXiv preprint arXiv:2307.14073 (2023)
Hu, Z., Xu, D.: Videocontrolnet: A motion- guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 (2023)
-
[58]
Mixture of contexts for long video generation
Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025)
-
[59]
Advances in neural information processing systems36, 8406– 8441 (2023)
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. Advances in neural information processing systems36, 8406– 8441 (2023)
work page 2023
-
[60]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Xu, D., Jiang, Y., Wang, P., Fan, Z., Wang, Y., Wang, Z.: Neurallift-360: Lifting an in- the-wild 2d photo to a 3d object with 360deg views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4479–4489 (2023)
work page 2023
-
[61]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it- 3d: High-fidelity 3d creation from a single image with diffusion prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22819–22829 (2023)
work page 2023
-
[62]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Obja- verse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
work page 2023
-
[63]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and ver- satile 3d generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21469–21480 (2025)
work page 2025
-
[64]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to- 3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298– 9309 (2023)
work page 2023
-
[65]
Advances in Neural Information Processing Systems36, 22226– 22246 (2023)
Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems36, 22226– 22246 (2023)
work page 2023
-
[66]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view gener- ation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)
work page 2024
-
[67]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C.,et al.: Wonder3d: Single image to 3d using cross-domain diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9970–9980 (2024)
work page 2024
-
[68]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp
Yu, H.-X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J.,et al.: Wonder- journey: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 6658–6667 (2024)
work page 2024
-
[69]
Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023
Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free genera- tion of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023)
-
[70]
In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp
Yu, H.-X., Duan, H., Herrmann, C., Free- man, W.T., Wu, J.: Wonderworld: Interactive 3d scene generation from a single image. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp. 5916–5926 (2025)
work page 2025
-
[71]
arXiv preprint arXiv:2504.02261 (2025)
Ni, C., Wang, X., Zhu, Z., Wang, W., Li, H., Zhao, G., Li, J., Qin, W., Huang, G., 24 Mei, W.: Wonderturbo: Generating interac- tive 3d world in 0.72 seconds. arXiv preprint arXiv:2504.02261 (2025)
-
[72]
arXiv preprint arXiv:2503.13265 (2025)
Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.-R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)
-
[73]
In: 2025 International Conference on 3D Vision (3DV), pp
Popov, S., Raj, A., Krainin, M., Li, Y., Freeman, W.T., Rubinstein, M.: Camctrl3d: Single-image scene exploration with precise 3d camera control. In: 2025 International Conference on 3D Vision (3DV), pp. 649–658 (2025). IEEE
work page 2025
-
[74]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., M¨ uller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with pre- cise camera control. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6121–6132 (2025)
work page 2025
-
[75]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20697–20709 (2024)
work page 2024
-
[76]
Advances in neural information processing systems33, 6840– 6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffu- sion probabilistic models. Advances in neural information processing systems33, 6840– 6851 (2020)
work page 2020
-
[77]
Advances in neural information processing systems35, 8633–8646 (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffu- sion models. Advances in neural information processing systems35, 8633–8646 (2022)
work page 2022
-
[78]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
work page 2022
-
[79]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2016–2029 (2025)
work page 2016
-
[80]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22160–22169 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.