arxiv: 2604.02828 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: no theorem link

NavCrafter: Exploring 3D Scenes from a Single Image

Hongbo Duan , Peiyu Zhuang , Yi Liu , Zhengyang Zhang , Yuxin Zhang , Pengting Luo , Fangming Liu , Xueqian Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords novel view synthesisvideo diffusion models3D scene reconstructioncamera controlsingle image to 3D3D Gaussian splattingviewpoint synthesis

0 comments

The pith

A video diffusion model steered by camera paths generates consistent novel views from one image to build explorable 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

NavCrafter turns a single image into sequences of new viewpoints that stay consistent over time and space. It does so by conditioning video diffusion models on planned camera trajectories and then expanding the covered scene geometry step by step. The framework adds explicit controls for camera motion and refines the output with depth-aligned 3D Gaussian splatting. If the approach holds, single photographs become enough to produce navigable 3D content without needing multiple views or direct depth sensors. Readers would care because acquiring full 3D data remains costly or impossible in many real-world settings.

Core claim

The paper claims that video diffusion models already encode rich 3D priors that, when conditioned through a multi-stage camera control mechanism using dual-branch injection and attention modulation plus a collision-aware trajectory planner, produce temporally and spatially consistent novel-view videos; these videos then feed an enhanced 3D Gaussian splatting pipeline with depth-aligned supervision and structural regularization to raise reconstruction fidelity under large viewpoint shifts.

What carries the argument

multi-stage camera control mechanism that conditions video diffusion models on diverse trajectories via dual-branch camera injection and attention modulation

If this is right

Novel views remain consistent even when the camera moves far from the original viewpoint.
3D reconstruction quality rises because the generated sequences supply aligned depth and structure for Gaussian splatting.
Camera paths can be planned in advance to cover more of the scene without collisions.
Single-image inputs suffice for full scene navigation and exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could be tested on scenes containing moving objects to check whether temporal consistency survives.
The output videos might serve as training data for other single-image 3D tasks such as depth completion.
If diffusion inference can be sped up, the method could support interactive 3D preview from a phone photo.
Limits may appear when the input image contains fine details or unusual lighting not well represented in the diffusion training data.

Load-bearing premise

Video diffusion models contain sufficient built-in 3D structure that proper camera conditioning will produce geometrically consistent multi-view output.

What would settle it

Synthesized frames under large viewpoint changes show object positions or depths that cannot be aligned into one coherent 3D model by the Gaussian splatting step.

Figures

Figures reproduced from arXiv: 2604.02828 by Fangming Liu, Hongbo Duan, Peiyu Zhuang, Pengting Luo, Xueqian Wang, Yi Liu, Yuxin Zhang, Zhengyang Zhang.

**Figure 1.** Figure 1: Visual results generated by NavCrafter. Given a single image, NavCrafter reconstructs 3D scenes from the camera-guided video [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The NavCrafter framework consists of three modules: (1) Controllable novel-view synthesis via video diffusion, integrating camera [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with prior methods in controllable novel view synthesis, where the first column shows the input image and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with prior methods in 3D scene reconstruction, where blue bounding boxes show visible regions derived [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Ablation study of 3D scene reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Comparison of reconstruction quality between Ours and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NavCrafter assembles a pipeline that conditions video diffusion on camera trajectories, adds a collision-aware planner, and refines 3DGS with depth alignment, but the abstract shows no metrics to support the SOTA claims.

read the letter

The main point is that this paper describes an end-to-end system for turning one image into navigable 3D scenes. It uses video diffusion to generate consistent novel-view sequences, injects camera paths through a dual-branch mechanism with attention modulation, plans trajectories to avoid collisions, and then refines the output with a depth-aligned 3D Gaussian Splatting stage that includes structural regularization.

Referee Report

2 major / 1 minor

Summary. NavCrafter introduces a pipeline that conditions video diffusion models via dual-branch camera injection and attention modulation, combined with a collision-aware trajectory planner and an enhanced depth-aligned 3D Gaussian Splatting reconstruction stage, to generate temporally and spatially consistent novel-view video sequences from a single input image while improving 3D scene fidelity.

Significance. If the experimental results hold, the work would provide a practical engineering route to controllable large-baseline novel-view synthesis and reconstruction without requiring multi-view or 3D training data, leveraging off-the-shelf video diffusion priors.

major comments (2)

[Abstract] Abstract: The central claim of achieving state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improved 3D reconstruction fidelity is unsupported by any quantitative metrics, baseline comparisons, error analysis, or tables; this gap is load-bearing because the abstract presents these outcomes as demonstrated results.
[Method] Method section: The claim that the dual-branch camera injection and attention modulation suffice to unlock usable 3D priors from video diffusion models for consistent multi-view synthesis rests on an untested assumption; no ablation isolating the contribution of each conditioning component or failure cases under large viewpoint changes is described.

minor comments (1)

[Method] The geometry-aware expansion strategy and structural regularization terms in the 3DGS pipeline are described at a high level; adding pseudocode or explicit loss formulations would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful for the referee's thorough review and constructive suggestions. Below we provide detailed responses to the major comments and indicate the revisions we plan to implement.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of achieving state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improved 3D reconstruction fidelity is unsupported by any quantitative metrics, baseline comparisons, error analysis, or tables; this gap is load-bearing because the abstract presents these outcomes as demonstrated results.

Authors: We thank the referee for highlighting this issue. While the manuscript describes extensive experiments demonstrating these outcomes, we agree that the abstract would benefit from more direct support. In the revised version, we will update the abstract to include key quantitative results, such as specific metric improvements over baselines, and reference the relevant tables and sections for full baseline comparisons and error analysis. revision: yes
Referee: [Method] Method section: The claim that the dual-branch camera injection and attention modulation suffice to unlock usable 3D priors from video diffusion models for consistent multi-view synthesis rests on an untested assumption; no ablation isolating the contribution of each conditioning component or failure cases under large viewpoint changes is described.

Authors: We agree that an ablation study would provide stronger validation for the proposed conditioning mechanism. We will add a new subsection in the Experiments section with ablations that isolate the effects of dual-branch camera injection and attention modulation on multi-view consistency. We will also include analysis of failure cases for large viewpoint changes, such as increased artifacts or reduced temporal coherence, to better characterize the method's limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes NavCrafter as an engineering pipeline that combines pre-trained video diffusion models with custom camera conditioning, trajectory planning, and a depth-aligned 3DGS refinement stage. No equations, closed-form derivations, or parameter-fitting steps are present that could reduce a claimed prediction back to the input data by construction. All load-bearing components are justified by reference to external pre-trained models and standard 3DGS techniques rather than self-referential definitions or self-citation chains that carry the central result. The claims rest on reported experimental outcomes, which are independent of any internal algebraic reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level framework components; all details would require the full manuscript.

pith-pipeline@v0.9.0 · 5475 in / 1161 out tokens · 44192 ms · 2026-05-13T20:12:37.322460+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

[1]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021
[2]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023
[3]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[4]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[5]

Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023

J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,”arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[6]

Wonderworld: Interactive 3d scene generation from a single image,

H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5916–5926

work page 2025
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, Y . Li, T. Chen, M. Xia, P. Luo, and Y . Shan, “Motionctrl: A unified and flexible motion controller for video generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024
[9]

Cameractrl: Enabling camera control for video diffusion models,

H. He, Y . Xu, Y . Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for video diffusion models,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[10]

Gen3c: 3d-informed world-consistent video generation with precise camera control,

X. Ren, T. Shen, J. Huang, H. Ling, Y . Lu, M. Nimier-David, T. M¨uller, A. Keller, S. Fidler, and J. Gao, “Gen3c: 3d-informed world-consistent video generation with precise camera control,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6121– 6132

work page 2025
[11]

Liftimage3d: Lifting any single im- age to 3d gaussians with video generation priors,

Y . Chen, C. Yang, J. Fang, X. Zhang, L. Xie, W. Shen, W. Dai, H. Xiong, and Q. Tian, “Liftimage3d: Lifting any single im- age to 3d gaussians with video generation priors,”arXiv preprint arXiv:2412.09597, 2024

work page arXiv 2024
[12]

Dropgaussian: Structural regularization for sparse-view gaussian splatting,

H. Park, G. Ryu, and W. Kim, “Dropgaussian: Structural regularization for sparse-view gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 600–21 609

work page 2025
[13]

Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,

J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu, “Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 775–20 785

work page 2024
[14]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views,

Y . Chen, C. Zheng, H. Xu, B. Zhuang, A. Vedaldi, T.-J. Cham, and J. Cai, “Mvsplat360: Feed-forward 360 scene synthesis from sparse views,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 064–107 086, 2024

work page 2024
[15]

You see it, you got it: Learning 3d creation on pose-free videos at scale,

B. Ma, H. Gao, H. Deng, Z. Luo, T. Huang, L. Tang, and X. Wang, “You see it, you got it: Learning 3d creation on pose-free videos at scale,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2016–2029

work page 2025
[16]

Cat3d: Create any- thing in 3d with multi-view diffusion models,

R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole, “Cat3d: Create anything in 3d with multi-view diffusion models,”arXiv preprint arXiv:2405.10314, 2024

work page arXiv 2024
[17]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

W. Yu, J. Xing, L. Yuan, W. Hu, X. Li, Z. Huang, X. Gao, T.-T. Wong, Y . Shan, and Y . Tian, “Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis,”arXiv preprint arXiv:2409.02048, 2024

work page internal anchor Pith review arXiv 2024
[18]

3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation,

F. Xiao, X. Liu, X. Wang, S. Peng, M. Xia, X. Shi, Z. Yuan, P. Wan, D. Zhang, and D. Lin, “3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation,” inThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[19]

Motionmaster: Training-free camera motion transfer for video generation,

T. Hu, J. Zhang, R. Yi, Y . Wang, H. Huang, J. Weng, Y . Wang, and L. Ma, “Motionmaster: Training-free camera motion transfer for video generation,”arXiv preprint arXiv:2404.15789, 2024

work page arXiv 2024
[20]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[21]

Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion,

W. Sun, S. Chen, F. Liu, Z. Chen, Y . Duan, J. Zhang, and Y . Wang, “Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion,” inInternational Conference on Computer Vision (ICCV), 2025

work page 2025
[22]

Wonderland: Navigating 3d scenes from a single image,

H. Liang, J. Cao, V . Goel, G. Qian, S. Korolev, D. Terzopoulos, K. N. Plataniotis, S. Tulyakov, and J. Ren, “Wonderland: Navigating 3d scenes from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 798–810

work page 2025
[23]

Stargen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation,

S. Zhai, Z. Ye, J. Liu, W. Xie, J. Hu, Z. Peng, H. Xue, D. Chen, X. Wang, L. Yanget al., “Stargen: A spatiotemporal autoregression framework with video diffusion model for scalable and controllable scene generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 822–26 833

work page 2025
[24]

Wonderjourney: Going from anywhere to everywhere,

H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6658–6667

work page 2024
[25]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

work page 2025
[27]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review arXiv 2025
[28]

Difix3d+: Improving 3d reconstructions with single-step diffusion models,

J. Z. Wu, Y . Zhang, H. Turki, X. Ren, J. Gao, M. Z. Shou, S. Fidler, Z. Gojcic, and H. Ling, “Difix3d+: Improving 3d reconstructions with single-step diffusion models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 024–26 035

work page 2025
[29]

Stereo Magnification: Learning View Synthesis using Multiplane Images

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Luet al., “Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 160–22 169

work page 2024
[31]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,”Advances in neural information processing systems, vol. 35, pp. 5775–5787, 2022

work page 2022
[32]

Scene splatter: Momentum 3d scene generation from single image with video dif- fusion model,

S. Zhang, J. Li, X. Fei, H. Liu, and Y . Duan, “Scene splatter: Momentum 3d scene generation from single image with video dif- fusion model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6089–6098

work page 2025
[33]

Tanks and temples: Benchmarking large-scale scene reconstruction,

A. Knapitsch, J. Park, Q.-Y . Zhou, and V . Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,”ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017

work page 2017
[34]

Structure-from-motion revisited,

J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113

work page 2016