pith. machine review for the scientific record. sign in

arxiv: 2604.13036 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Lyra 2.0: Explorable Generative 3D Worlds

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D world generationvideo generationgenerative reconstructionconsistent video synthesis3D scene reconstructionlong trajectory generationspatial consistencytemporal consistency
0
0 comments X

The pith

Maintaining per-frame 3D geometry for routing and self-augmented training generates longer 3D-consistent video trajectories for high-quality scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that video generation models can produce persistent explorable 3D worlds by overcoming spatial forgetting and temporal drifting over long camera trajectories. It keeps per-frame 3D geometry only to retrieve relevant past frames and match viewpoints for the generative model, which then synthesizes appearances. It also trains the model on histories that include its own prior outputs so the model learns to correct rather than accumulate errors. If this holds, creators could scale video models to generate large, revisitable 3D environments and convert them into renderable scenes for simulation and games.

Core claim

Lyra 2.0 shows that addressing spatial forgetting with per-frame 3D geometry used solely for routing past frames and establishing dense correspondences, combined with self-augmented histories to teach drift correction, produces substantially longer and 3D-consistent video trajectories. These trajectories then allow fine-tuning of feed-forward reconstruction models to recover high-quality 3D scenes ready for real-time rendering and simulation.

What carries the argument

Per-frame 3D geometry maintained exclusively for information routing to retrieve past frames and build dense correspondences, paired with self-augmented history training for learning to correct accumulated synthesis errors.

Load-bearing premise

That the 3D geometry can be accurately maintained and used for routing without correspondence errors, and that the generative model will synthesize correct appearances from the routed information without new hallucinations on revisits.

What would settle it

Generate a long camera trajectory that revisits an initial viewpoint after many steps and check whether the 3D reconstruction from the video matches the original scene geometry and appearance without structural changes or drift; mismatches would show the method does not prevent forgetting or correct drift.

Figures

Figures reproduced from arXiv: 2604.13036 by Huan Ling, Jiahui Huang, Jiawei Ren, Jun Gao, Kai He, Nicholas Sharp, Ruilong Li, Sangeetha Grama Srinivasan, Sanja Fidler, Sherwin Bahmani, Tianchang Shen, Tianshi Cao, Xuanchi Ren, Zan Gojcic, Zian Wang.

Figure 1
Figure 1. Figure 1: Lyra 2.0 enables long-horizon 3D-consistent scene generation from a single image. Starting from an input image, users iteratively define camera motion to explore the scene, while Lyra 2.0 synthesizes spatially persistent video outputs that progressively expand the environment. These videos can be directly reconstructed into high-fidelity 3D Gaussians and surface meshes, yielding 3D assets deployable in sim… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. (Left) Given an input image, Lyra 2.0 iteratively generates video segments guided by a user-defined camera trajectory from an interactive 3D explorer and an optional text prompt, lifting each segment into 3D point clouds fed back for continued navigation. Generated video frames are finally reconstructed and exported as 3D Gaussians or meshes. (Right) At each step, history frames with maxim… view at source ↗
Figure 3
Figure 3. Figure 3: Video generation comparisons. Given a single input image from Tanks and Temples, we compare long-horizon generations (∼frame 800+) from all evaluated video models. Baselines exhibit severe quality degradation, geometric distortions, or content drifting at long horizons, while our method maintains realistic structures and appearances. out-of-domain generalization. We follow standard protocol [81, 82, 128] a… view at source ↗
Figure 4
Figure 4. Figure 4: 3DGS comparisons. We compare renderings from 3DGS scenes reconstructed from video diffusion model outputs, starting from a single input image from Tanks and Temples. While all baselines produce scenes with artifacts and floaters, our pipeline is able to generate realistic 3D scenes with high fidelity. We further compare with Lyra [2] and FantasyWorld [14] in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with Lyra and FantasyWorld. We show 3DGS renderings (Lyra and Ours) and point cloud renderings (FantasyWorld) in bird’s-eye view. Red bounding boxes highlight approximately the same spatial region across methods. Our interactive exploration framework produces scenes of significantly greater scale and complexity [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation study. Given a single input image, we compare generations from our full model and ablated variants on Tanks and Temples scenes. Interactive GUI Issac Sim simulation Explore Generate Surface mesh reconstruction [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Applications. Our interactive interface allows users to specify camera trajectories within the 3D cache to easily generate novel viewpoints. Moreover, the reconstructed 3DGS scenes can be converted into surface meshes and integrated into embodied AI simulators such as NVIDIA Isaac Sim for robot simulation. Embodied AI Simulation. The 3D Gaussian Splatting representations and meshes generated by our pipelin… view at source ↗
Figure 8
Figure 8. Figure 8: In-the-Wild Scene Generation. We show video generations and 3DGS reconstructions for challenging in-the-wild input images that go beyond the training data distribution. Our approach supports flexible camera trajectories specified in the GUI for world exploration, including combining multiple trajectories from the same starting point (see second example). Acknowledgement We would like to thank Product Manag… view at source ↗
Figure 9
Figure 9. Figure 9: Target-frame coverage vs. number of retrieved spatial memory frames. We evaluate on training [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Lyra 2.0, a framework for generating persistent, explorable 3D worlds via camera-controlled video trajectories. It addresses spatial forgetting by maintaining per-frame 3D geometry used exclusively for routing relevant past frames and establishing dense correspondences, while delegating appearance synthesis to the generative prior; temporal drifting is mitigated by training on self-augmented histories that expose the model to its own degraded outputs. These longer 3D-consistent trajectories are then used to fine-tune feed-forward reconstruction models for high-quality 3D scene recovery.

Significance. If the routing and self-augmentation techniques prove effective, the work could meaningfully advance scalable generative 3D scene synthesis by bridging video diffusion models with reliable 3D reconstruction, enabling consistent long-horizon exploration of large environments without rapid degradation.

major comments (2)
  1. [Abstract] Abstract (method description): The claim that maintaining per-frame 3D geometry 'solely for information routing' addresses spatial forgetting is load-bearing on the assumption that depth/pose estimates derived from prior generative outputs remain sufficiently accurate for correct frame retrieval and correspondence establishment. No mechanism, bound, or correction for potential noise/drift in these estimates is described, which risks incorrect routing and subsequent hallucinations on revisits.
  2. [Abstract] Abstract: The manuscript states that the proposed techniques 'enable substantially longer and 3D-consistent video trajectories' and 'reliably recover high-quality 3D scenes,' yet provides no quantitative results, ablation studies, implementation details, or baseline comparisons to support these outcomes. This absence prevents verification of whether the methods actually mitigate the identified degradations.
minor comments (1)
  1. [Abstract] The abstract introduces terms such as 'self-augmented histories' without a concise definition or high-level training procedure, which could be clarified for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. Below we provide detailed responses to each major comment and outline the revisions we have made to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract (method description): The claim that maintaining per-frame 3D geometry 'solely for information routing' addresses spatial forgetting is load-bearing on the assumption that depth/pose estimates derived from prior generative outputs remain sufficiently accurate for correct frame retrieval and correspondence establishment. No mechanism, bound, or correction for potential noise/drift in these estimates is described, which risks incorrect routing and subsequent hallucinations on revisits.

    Authors: We agree that the assumption regarding the accuracy of depth and pose estimates is critical and not fully elaborated in the abstract. The manuscript describes using these estimates exclusively for routing relevant frames and establishing correspondences. To address the potential for noise and drift, we have added a new subsection in the methods detailing the use of robust matching techniques, such as RANSAC-based filtering for correspondences and a confidence threshold for frame retrieval. Additionally, we have included an analysis of how the self-augmented training helps mitigate the effects of imperfect routing by exposing the model to varied histories. We believe this clarifies the mechanism and reduces the risk of hallucinations. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript states that the proposed techniques 'enable substantially longer and 3D-consistent video trajectories' and 'reliably recover high-quality 3D scenes,' yet provides no quantitative results, ablation studies, implementation details, or baseline comparisons to support these outcomes. This absence prevents verification of whether the methods actually mitigate the identified degradations.

    Authors: We agree that supporting quantitative evidence, ablations, and comparisons are essential to validate the claims. We have revised the manuscript to incorporate a comprehensive experiments section featuring these elements, including metrics demonstrating longer consistent trajectories and improved 3D scene recovery, along with implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper describes a framework that maintains per-frame 3D geometry solely for routing past frames and correspondences while delegating appearance synthesis to the generative prior, and trains with self-augmented histories to correct drift. These are presented as methodological innovations that build on external video generation models and feed-forward reconstruction techniques. No equations, fitted parameters, or self-citations are shown to reduce any central claim to its own inputs by construction. The derivation chain remains self-contained, relying on stated design choices rather than self-definitional loops or renamed predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from video generation and 3D reconstruction literature with no new free parameters or invented entities introduced in the abstract.

axioms (2)
  • domain assumption Video generation models can be guided by external 3D geometry for frame retrieval and correspondence without altering their appearance synthesis prior.
    Invoked in the information routing mechanism to address spatial forgetting.
  • domain assumption Training on self-generated degraded histories enables the model to correct rather than propagate temporal drift.
    Core premise for the self-augmented training strategy.

pith-pipeline@v0.9.0 · 5631 in / 1410 out tokens · 68534 ms · 2026-05-10T15:26:18.811138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

Reference graph

Works this paper leans on

141 extracted references · 60 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Bahmani, J

    S. Bahmani, J. J. Park, D. Paschalidou, X. Yan, G. Wetzstein, L. Guibas, and A. Tagliasacchi. CC3D: Layout-conditioned generation of compositional 3D scenes. InProc. ICCV, 2023. 17

  2. [2]

    Bahmani, T

    S. Bahmani, T. Shen, J. Ren, J. Huang, Y. Jiang, H. Turki, A. Tagliasacchi, D. B. Lindell, Z. Gojcic, S. Fidler, H. Ling, J. Gao, and X. Ren. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. InICLR,

  3. [3]

    Bahmani, I

    S. Bahmani, I. Skorokhodov, G. Qian, A. Siarohin, W. Menapace, A. Tagliasacchi, D. B. Lindell, and S. Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers.Proc. CVPR, 2025. 3

  4. [4]

    Bahmani, I

    S. Bahmani, I. Skorokhodov, A. Siarohin, W. Menapace, G. Qian, M. Vasilkovsky, H.-Y. Lee, C. Wang, J. Zou, A. Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.Proc. ICLR, 2025. 3

  5. [5]

    P. J. Ball, J. Bauer, F. Belletti, B. Brownfield, A. Ephrat, S. Fruchter, A. Gupta, K. Holsheimer, A. Holynski, J. Hron, et al. Genie 3: A new frontier for world models.Google DeepMind Blog, pages 253–279, 2025. 3

  6. [6]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators.OpenAI technical reports, 2024. 2

  7. [7]

    E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al. Efficient geometry-aware 3D generative adversarial networks. InProc. CVPR, 2022. 17

  8. [8]

    E. R. Chan, K. Nagano, M. A. Chan, A. W. Bergman, J. J. Park, A. Levy, M. Aittala, S. De Mello, T. Karras, and G. Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. InProc. ICCV, 2023. 17

  9. [9]

    Charatan, S

    D. Charatan, S. L. Li, A. Tagliasacchi, and V. Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProc. CVPR, 2024. 4

  10. [10]

    E. M. Chen, S. Holalkere, R. Yan, K. Zhang, and A. Davis. Ray conditioning: Trading photo-consistency for photo-realism in multi-view image generation. InICCV, 2023. 3

  11. [11]

    K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese. Text2Shape: Generating shapes from natural language by learning joint embeddings. InProc. ACCV, 2018. 17

  12. [12]

    R. Chen, Y. Chen, N. Jiao, and K. Jia. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation.arXiv preprint arXiv:2303.13873, 2023. 17

  13. [13]

    T. Cosmos. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2, 4

  14. [14]

    Y. Dai, F. Jiang, C. Wang, M. Xu, and Y. Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction. InICLR, 2026. 11, 12

  15. [15]

    Dalal, D

    K. Dalal, D. Koceja, J. Xu, Y. Zhao, S. Han, K. C. Cheung, J. Kautz, Y. Choi, Y. Sun, and X. Wang. One-minute video generation with test-time training. InCVPR, 2025. 4

  16. [16]

    PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

    I. Deutsch, N. Moënne-Loccoz, G. State, and Z. Gojcic. Ppisp: Physically-plausible compensation and control of photometric variations in radiance field reconstruction.arXiv preprint arXiv:2601.18336, 2026. 14

  17. [17]

    DeVries, M

    T. DeVries, M. A. Bautista, N. Srivastava, G. W. Taylor, and J. M. Susskind. Unconstrained scene generation with locally conditioned radiance fields. InProc. ICCV, 2021. 17

  18. [18]

    Duan, H.-X

    H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation. InICCV, 2025. 10

  19. [19]

    Q. Feng, Z. Xing, Z. Wu, and Y.-G. Jiang. FDGaussian: Fast Gaussian splatting from single image via geometric-aware diffusion model.arXiv preprint arXiv:2403.10242, 2024. 17

  20. [20]

    J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. InProc. NeurIPS, 2022. 17 19 Lyra 2.0: Explorable Generative 3D Worlds

  21. [21]

    R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole. Cat3d: Create anything in 3d with multi-view diffusion models. InProc. NeurIPS, 2024. 17

  22. [22]

    W. Gao, N. Aigerman, T. Groueix, V. Kim, and R. Hanocka. TextDeformer: Geometry manipulation using text guidance. InSIGGRAPH, 2023. 17

  23. [23]

    J. Gu, A. Trevithick, K.-E. Lin, J. M. Susskind, C. Theobalt, L. Liu, and R. Ramamoorthi. NerfDiff: Single-image view synthesis with NeRF-guided distillation from 3D-aware diffusion. InProc. ICML, 2023. 17

  24. [24]

    Y. Gu, W. Mao, and M. Z. Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025. 4

  25. [25]

    J. Han, F. Kokkinos, and P. Torr. VFusion3D: Learning scalable 3D generative models from video diffusion models. arXiv preprint arXiv:2403.12034, 2024. 18

  26. [26]

    H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 2, 3, 5

  27. [27]

    X. He, J. Chen, S. Peng, D. Huang, Y. Li, X. Huang, C. Yuan, W. Ouyang, and T. He. GVGEN: Text-to-3D generation with volumetric representation.arXiv preprint arXiv:2403.12957, 2024. 17

  28. [28]

    X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 3

  29. [29]

    Höllein, A

    L. Höllein, A. Božič, N. Müller, D. Novotny, H.-Y. Tseng, C. Richardt, M. Zollhöfer, and M. Nießner. ViewDiff: 3D-consistent image generation with text-to-image models. InProc. CVPR, 2024. 17

  30. [30]

    Höllein, A

    L. Höllein, A. Cao, A. Owens, J. Johnson, and M. Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProc. ICCV, 2023. 17

  31. [31]

    Höllein and M

    L. Höllein and M. Nießner. World reconstruction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026. 4

  32. [32]

    Y. Hong, B. Liu, M. Wu, Y. Zhai, K.-W. Chang, L. Li, K. Lin, C.-C. Lin, J. Wang, Z. Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 4

  33. [33]

    Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025. 4

  34. [34]

    Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan. LRM: Large reconstruction model for single image to 3D. InProc. ICLR, 2024. 18

  35. [35]

    arXiv preprint arXiv:2508.10934 (2025)

    J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 9, 10

  36. [36]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 8

  37. [37]

    HunyuanWorld

    T. HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025. 9, 10

  38. [38]

    A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole. Zero-shot text-guided object generation with dream fields. InProc. CVPR, 2022. 17

  39. [39]

    N. Jetchev. ClipMatrix: Text-controlled creation of 3D textured meshes.arXiv preprint arXiv:2109.12922, 2021. 17

  40. [40]

    Jiang and L

    L. Jiang and L. Wang. Brightdreamer: Generic 3D Gaussian generative framework for fast text-to-3D synthesis. arXiv preprint arXiv:2403.11273, 2024. 18

  41. [41]

    Y. Kant, E. Weber, J. K. Kim, R. Khirodkar, S. Zhaoen, J. Martinez, I. Gilitschenski, S. Saito, and T. Bagautdinov. Pippo: High-resolution multi-view humans from a single image. InProc. CVPR, 2025. 17 20 Lyra 2.0: Explorable Generative 3D Worlds

  42. [42]

    Katzir, O

    O. Katzir, O. Patashnik, D. Cohen-Or, and D. Lischinski. Noise-free score distillation. InProc. ICLR, 2024. 17

  43. [43]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. InACM TOG, 2023. 4, 17

  44. [44]

    S. W. Kim, B. Brown, K. Yin, K. Kreis, K. Schwarz, D. Li, R. Rombach, A. Torralba, and S. Fidler. NeuralField-LDM: Scene generation with hierarchical latent diffusion models. InProc. CVPR, 2023. 17

  45. [45]

    Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

    T. Kirschstein, J. Romero, A. Sevastopolsky, M. Nießner, and S. Saito. Avat3r: Large animatable gaussian reconstruc- tion model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025. 18

  46. [46]

    Knapitsch, J

    A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM TOG, 36(4), 2017. 9

  47. [47]

    K. Lee, K. Sohn, and J. Shin. DreamFlow: High-quality text-to-3D generation by approximating probability flow. In Proc. ICLR, 2024. 17

  48. [48]

    G. Li, S. Zheng, S. Xu, J. Chen, B. Li, X. Hu, L. Zhao, and P.-T. Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025. 3, 4

  49. [49]

    J. Li, H. Tan, K. Zhang, Z. Xu, F. Luan, Y. Xu, Y. Hong, K. Sunkavalli, G. Shakhnarovich, and S. Bi. Instant3D: Fast text-to-3D with sparse-view generation and large reconstruction model. InProc. ICLR, 2024. 18

  50. [50]

    J. Li, J. Tang, Z. Xu, L. Wu, Y. Zhou, S. Shao, T. Yu, Z. Cao, and Q. Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025. 3

  51. [51]

    R. Li, P. Torr, A. Vedaldi, and T. Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, pages 25690–25699, 2025. 4, 9, 10, 11

  52. [52]

    X. Li, T. Wang, Z. Gu, S. Zhang, C. Guo, and L. Cao. Flashworld: High-quality 3d scene generation within seconds. InICLR, 2026. 4

  53. [53]

    Z. Li, Y. Chen, L. Zhao, and P. Liu. Controllable text-to-3D generation via surface-aligned Gaussian splatting.arXiv preprint arXiv:2403.09981, 2024. 17

  54. [54]

    Liang, J

    H. Liang, J. Cao, V. Goel, G. Qian, S. Korolev, D. Terzopoulos, K. N. Plataniotis, S. Tulyakov, and J. Ren. Wonderland: Navigating 3d scenes from a single image.Proc. CVPR, 2025. 4, 18

  55. [55]

    Liang, J

    H. Liang, J. Ren, A. Mirzaei, A. Torralba, Z. Liu, I. Gilitschenski, S. Fidler, C. Oztireli, H. Ling, Z. Gojcic, and J. Huang. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos.Proc. NeurIPS, 2025. 18

  56. [56]

    arXiv preprint arXiv:2311.11284 (2023)

    Y. Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y. Chen. Luciddreamer: Towards high-fidelity text-to-3D generation via interval score matching.arXiv preprint arXiv:2311.11284, 2023. 17

  57. [57]

    C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin. Magic3D: High-resolution text-to-3D content creation. InProc. CVPR, 2023. 17

  58. [58]

    H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 3, 4, 6, 8, 9, 11

  59. [59]

    Y. Lin, H. Han, C. Gong, Z. Xu, Y. Zhang, and X. Li. Consistent123: One image to highly consistent 3D asset using case-aware diffusion priors. InarXiv preprint arXiv:2309.17261, 2023. 17

  60. [60]

    L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169, 2024. 9, 17

  61. [61]

    Flow Matching for Generative Modeling

    Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 4

  62. [62]

    P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation. InICRA, 2025. 4 21 Lyra 2.0: Explorable Generative 3D Worlds

  63. [63]

    P. Liu, Y. Wang, F. Sun, J. Li, H. Xiao, H. Xue, and X. Wang. Isotropic3D: Image-to-3D generation based on a single clip embedding.arXiv preprint arXiv:2403.10395, 2024. 17

  64. [64]

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick. Zero-1-to-3: Zero-shot one image to 3D object. InProc. ICCV, 2023. 17

  65. [65]

    X. Liu, X. Zhan, J. Tang, Y. Shan, G. Zeng, D. Lin, X. Liu, and Z. Liu. HumanGaussian: Text-driven 3D human generation with Gaussian splatting. InProc. CVPR, 2024. 17

  66. [66]

    Y. Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. InProc. ICLR, 2024. 17

  67. [67]

    Long, Y.-C

    X. Long, Y.-C. Guo, C. Lin, Y. Liu, Z. Dou, L. Liu, Y. Ma, S.-H. Zhang, M. Habermann, C. Theobalt, et al. Wonder3D: Single image to 3D using cross-domain diffusion. InProc. CVPR, 2024. 17

  68. [68]

    Decoupled Weight Decay Regularization

    I. Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 17

  69. [69]

    Y. Lu, X. Ren, J. Yang, T. Shen, Z. Wu, J. Gao, Y. Wang, S. Chen, M. Chen, S. Fidler, et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models.arXiv preprint arXiv:2412.03934,

  70. [70]

    X. Mao, Z. Li, C. Li, X. Xu, K. Ying, T. He, J. Pang, Y. Qiao, and K. Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025. 3, 9, 10, 11

  71. [71]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InProc. ECCV, 2020. 17

  72. [72]

    K. Museth. Vdb: High-resolution sparse volumes with dynamic topology.ACM Transactions on Graphics (TOG), 32(3):1–22, 2013. 8

  73. [73]

    Or-El, X

    R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, and I. Kemelmacher-Shlizerman. StyleSDF: High-resolution 3D-consistent image and geometry generation. InProc. CVPR, 2022. 17

  74. [74]

    R. Po, E. R. Chan, C. Chen, and G. Wetzstein. Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models.arXiv preprint arXiv:2512.12080, 2025. 4

  75. [75]

    Poole, A

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. DreamFusion: Text-to-3D using 2D diffusion. InProc. ICLR, 2023. 17

  76. [76]

    G. Qian, J. Cao, A. Siarohin, Y. Kant, C. Wang, M. Vasilkovsky, H.-Y. Lee, Y. Fang, I. Skorokhodov, P. Zhuang, et al. Atom: Amortized text-to-mesh using 2d diffusion.arXiv preprint arXiv:2402.00867, 2024. 18

  77. [77]

    G. Qian, J. Mai, A. Hamdi, J. Ren, A. Siarohin, B. Li, H.-Y. Lee, I. Skorokhodov, P. Wonka, S. Tulyakov, et al. Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. InProc. ICLR, 2024. 17

  78. [78]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InProc. ICML, 2021. 17

  79. [79]

    X. Ren, J. Huang, X. Zeng, K. Museth, S. Fidler, and F. Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProc. CVPR, 2024. 17

  80. [80]

    X. Ren, Y. Lu, H. Liang, Z. Wu, H. Ling, M. Chen, S. Fidler, F. Williams, and J. Huang. Scube: Instant large-scale scene reconstruction using voxsplats.Proc. NeurIPS, 2024. 4

Showing first 80 references.