pith. sign in

arxiv: 2603.23488 · v2 · submitted 2026-03-24 · 💻 cs.CV

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Pith reviewed 2026-05-15 00:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords novel view synthesismonocular depthunpaired trainingmasked losseszero-shotin-the-wild imagesgeometry-free inference
0
0 comments X

The pith

A single unpaired image plus monocular depth is enough to train novel view synthesis at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that novel view synthesis no longer needs paired multi-view images for supervision. Instead, a monocular depth estimator lifts each source photo into 3D space, a random camera shift creates a pseudo-target view, and masked losses ignore the disoccluded pixels so training can proceed on raw internet data. This setup trains on 30 million uncurated photos and produces a model that needs no depth or 3D structure at inference time. The resulting system beats prior methods in zero-shot tests on diverse scenes while running 600 times faster.

Core claim

OVIE is trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation.

What carries the argument

Monocular depth estimator serving as geometric scaffold to lift single images into 3D, followed by camera transformation and masked reprojection losses that apply only to valid pixels.

If this is right

  • Novel view synthesis models can now be trained on tens of millions of single-view internet photos instead of scarce curated multi-view sets.
  • At test time the model requires only the input image and produces new views without any depth or 3D representation.
  • Zero-shot performance on in-the-wild images exceeds that of earlier methods trained with explicit multi-view pairs.
  • Training becomes practical at the scale of large uncurated web collections while inference runs hundreds of times faster.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting-and-masking idea could be applied to train other geometry tasks such as surface normal estimation or object insertion using only single images.
  • Removing the paired-data requirement opens the door to training on video frames treated as independent views.
  • The speed gain suggests deployment on mobile devices for real-time view synthesis in AR or video editing.

Load-bearing premise

The monocular depth estimates are accurate enough that the pseudo-target views they produce contain usable training signals despite any depth errors.

What would settle it

Run OVIE and a paired-supervision baseline on the same held-out multi-view dataset of real-world scenes and measure whether OVIE's novel-view error is at least as low as the baseline.

Figures

Figures reproduced from arXiv: 2603.23488 by Adrien Ramanana Rahary, David Picard, Nicolas Dufour, Patrick Perez.

Figure 1
Figure 1. Figure 1: OVIE generates novel views from a single image [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Top: From web-sourced images I0, a frozen monocular depth estimator extracts per-image 3D point clouds P. We then sample camera trans￾formations T0→1 ∈ SE(3) (rotation and translation), apply them to the point clouds, and reproject to generate pseudo-target views I ∗ 1 . Bottom: Our model fθ takes a source image I0 and, conditioned on a camera transformation T0→1, predicts the cor￾respondi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Metric scale understanding. The same 20 cm camera translation is applied to two scenes of different physical scales. The close-up banana (left, 50 cm away) un￾dergoes a large apparent displacement, while the room-scale scene (right, 3 m away) shows a proportionally smaller shift consistent with metrically correct parallax. pretrained models released by their authors. This creates a deliberate asymme￾try: o… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling with dataset size. PSNR and FID on RealEstate10K as a function of training set size. Both metrics improve consistently as data volume increases. SSIM and LPIPS curves, which follow the same trend, are reported in the Supplementary. Effect of data diversity. To isolate diversity from scale, we train four models each on a single data source subsampled to 2M images (the size of our smallest source, Pl… view at source ↗
Figure 6
Figure 6. Figure 6: Quality vs. Inference tradeoff on DL3DV. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scaling with dataset size – SSIM and LPIPS. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Quality vs. Inference tradeoff on DL3DV – SSIM and LPIPS. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on out-of-distribution images. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison with InfiniteNature-Zero [ [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
read the original abstract

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OVIE, a monocular novel-view synthesis model trained exclusively on unpaired in-the-wild internet images. It uses a pre-trained monocular depth estimator to lift source images into 3D, applies random rigid transforms, and projects to create pseudo-target views, with a masked formulation of photometric, perceptual, and textural losses restricted to valid (non-disoccluded) regions. At inference the model is geometry-free. The central claim is that this yields superior zero-shot performance on novel-view synthesis benchmarks while being 600x faster than the next-best baseline.

Significance. If the empirical claims hold, the work would be significant for scaling novel-view synthesis to diverse, uncurated data at internet scale without requiring multi-view supervision. The geometry-free inference and reported speed advantage could broaden practical applicability. The public release of code and models is a positive contribution for reproducibility.

major comments (2)
  1. [§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.
  2. [§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.
minor comments (2)
  1. [Figure 2] Figure 2 and associated text: the visualization of the lifting-projection pipeline would benefit from explicit annotation of the mask M and the regions excluded from the loss.
  2. [Related Work] Related work section: the discussion of prior monocular depth-based methods could more explicitly contrast the proposed masked formulation against existing warping-based approaches.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and positive assessment of the significance of our work. We address each major comment below and have revised the manuscript to improve clarity and provide additional analysis where feasible.

read point-by-point responses
  1. Referee: [§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.

    Authors: We agree that quantifying the effect of depth inaccuracies would strengthen the paper. Because our training uses unpaired in-the-wild images, ground-truth depth is unavailable and a full GT-depth ablation is not possible. We have added a new ablation in the revised manuscript that replaces the depth estimator with a noisy variant (introducing scale drift, boundary misalignment, and holes) and measures the resulting degradation in novel-view quality. The results show that the masked loss formulation limits error propagation by excluding disoccluded and invalid pixels, supporting the original claim. We have also expanded the discussion in §3 to describe this mechanism explicitly. revision: partial

  2. Referee: [§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.

    Authors: We thank the referee for this request for clarification. In the revised manuscript we have expanded §4 with: (i) a table listing every baseline together with its original training data scale and supervision type, (ii) explicit confirmation that the monocular depth estimator is pre-trained and kept frozen throughout OVIE training, and (iii) an additional controlled experiment that isolates the contribution of the masked loss from the depth prior. These additions make clear that the reported gains arise from the monocular training procedure on 30 M unpaired images rather than from the depth estimator alone. revision: yes

standing simulated objections not resolved
  • A controlled experiment that replaces the depth estimator with ground-truth depth on the in-the-wild unpaired training set, as no such ground-truth depth exists for internet-scale images.

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper's central training procedure lifts images using an external pre-trained monocular depth estimator (not fitted or defined within the work) to synthesize pseudo-target views via rigid transforms and projection, then applies standard masked photometric/perceptual/textural losses only on valid regions. This objective is not equivalent to the inference-time novel-view output by construction, as the network must learn a generalizable mapping that operates without depth or 3D at test time. No load-bearing self-citations, self-definitional steps, or fitted parameters renamed as predictions appear in the provided text; the method is self-contained against external benchmarks and does not reduce the claimed result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of an external monocular depth estimator for creating training targets; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Monocular depth estimators provide sufficiently accurate 3D lifts to serve as geometric scaffolds for pseudo-view generation
    Invoked to lift source images into 3D and project after camera transformation.

pith-pipeline@v0.9.0 · 5471 in / 1172 out tokens · 31771 ms · 2026-05-15T00:31:48.749718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages

  1. [1]

    CVPR (2024) 3, 8, 14

    Astruc, G., Dufour, N., Siglidis, I., Aronssohn, C., Bouia, N., Fu, S., Loiseau, R., Nguyen, V.N., Raude, C., Vincent, E., Xu, L., Zhou, H., Landrieu, L.: OpenStreetView-5M: The many roads to global visual geolocation. CVPR (2024) 3, 8, 14

  2. [2]

    arXiv (2025) 4, 10

    Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv (2025) 4, 10

  3. [3]

    arXiv (2023) 5

    Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv (2023) 5

  4. [4]

    arXiv (2023) 5

    Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv (2023) 5

  5. [5]

    In: ICCV (2021) 7

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 7

  6. [6]

    In: CVPR (2021) 4

    Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021) 4

  7. [7]

    In: CVPR (2022) 4

    Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR (2022) 4

  8. [8]

    In: CVPR (2024) 3

    Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: CVPR (2024) 3

  9. [9]

    IEEE Transactions on Image Pro- cessing (1997) 7, 12

    Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Deterministic edge- preserving regularization in computed imaging. IEEE Transactions on Image Pro- cessing (1997) 7, 12

  10. [10]

    arXiv (2024) 3 16 A

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv (2024) 3 16 A. Ramanana Rahary et al

  11. [11]

    In: NeurIPS (2024) 3

    Chen, Y., Zheng, C., Xu, H., Zhuang, B., Vedaldi, A., Cham, T.J., Cai, J.: Mvs- plat360: Feed-forward 360 scene synthesis from sparse views. In: NeurIPS (2024) 3

  12. [12]

    arXiv (2023) 5

    Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv (2023) 5

  13. [13]

    In: CVPR (2017) 4

    Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017) 4

  14. [14]

    NeurIPS (2023) 2

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS (2023) 2

  15. [15]

    CVPR (2023) 2, 4

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023) 2, 4

  16. [16]

    ICLR (2021) 10

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021) 10

  17. [17]

    CVPR (2025) 4, 5, 9, 12, 26

    Elata, N., Kawar, B., Ostrovsky-Berman, Y., Farber, M., Sokolovsky, R.: Novel view synthesis with pixel-space diffusion models. CVPR (2025) 4, 5, 9, 12, 26

  18. [18]

    In: Proc

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proc. ICML (2024) 4

  19. [19]

    In: CVPR (2021) 8, 25

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021) 8, 25

  20. [20]

    In: NeurIPS (2022) 4

    Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: a generative model of high quality 3d textured shapes learned from images. In: NeurIPS (2022) 4

  21. [21]

    CVPR (2017) 7, 25

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. CVPR (2017) 7, 25

  22. [22]

    In: ICCV (2021) 5

    Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., Kaeser, D., Freeman, W.T., Salesin, D., Curless, B., et al.: Slide: Single image 3d photography with soft layering and depth-aware inpainting. In: ICCV (2021) 5

  23. [23]

    In: CVPR (2024) 4

    Jang, W., Agapito, L.: Nvist: In the wild new view synthesis from a single image with transformers. In: CVPR (2024) 4

  24. [24]

    In: ICLR (2025) 3

    Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. In: ICLR (2025) 3

  25. [25]

    In: CVPR (2024) 5

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: CVPR (2024) 5

  26. [26]

    IEEE TPAMI (2025) 5

    Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. IEEE TPAMI (2025) 5

  27. [27]

    SIGGRAPH (2023) 3

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. SIGGRAPH (2023) 3

  28. [28]

    IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Generation 17

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual rela- tionship detection at scale. IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Ge...

  29. [29]

    arXiv (2025) 4

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv (2025) 4

  30. [30]

    In: ECCV (2022) 4, 20, 27, 36

    Li, Z., Wang, Q., Snavely, N., Kanazawa, A.: Infinitenature-zero: Learning perpet- ual view generation of natural scenes from single images. In: ECCV (2022) 4, 20, 27, 36

  31. [31]

    In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

  32. [32]

    In: ICCV (2021) 5

    Liu, A., Makadia, A., Tucker, R., Snavely, N., Jampani, V., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: ICCV (2021) 5

  33. [33]

    In: ICCV (2023) 4, 5

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: ICCV (2023) 4, 5

  34. [34]

    arXiv (2026) 7

    Ma, Z., Xu, R., Zhang, S.: Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. arXiv (2026) 7

  35. [35]

    In: ICCV (2025) 4

    Maillard, L., Durand, T., Rahary, A.R., Ovsjanikov, M.: Laconic: A 3d layout adapter for controllable image creation. In: ICCV (2025) 4

  36. [36]

    In: ICLR (2026) 3

    Mescheder, L., Dong, W., Li, S., Bai, X., Santos, M., Hu, P., Lecouat, B., Zhen, M., Delaunoy, A., Fang, T., Tsin, Y., Richter, S.R., Koltun, V.: Sharp monocular view synthesis in less than a second. In: ICLR (2026) 3

  37. [37]

    In: ECCV (2020) 3

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020) 3

  38. [38]

    In: CVPR (2024) 5

    Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: CVPR (2024) 5

  39. [39]

    In: ICCV (2019) 4

    Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsuper- vised learning of 3d representations from natural images. In: ICCV (2019) 4

  40. [40]

    In: CVPR (2021) 4

    Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: CVPR (2021) 4

  41. [41]

    ACM Transactions on Graphics (2019) 5

    Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. ACM Transactions on Graphics (2019) 5

  42. [42]

    arXiv (2023) 7

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

  43. [43]

    In: ICCV (2023) 8, 24

    Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers . In: ICCV (2023) 8, 24

  44. [44]

    arXiv (2025) 5

    Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Gool, L.V.: UniDepthV2: Universal monocular metric depth estimation made simpler. arXiv (2025) 5

  45. [45]

    In: CVPR (2024) 5

    Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: UniDepth: Universal monocular metric depth estimation. In: CVPR (2024) 5

  46. [46]

    arXiv (2021) 4

    Ramirez, P.Z., Tonioni, A., Tombari, F.: Unsupervised novel view synthesis from a single image. arXiv (2021) 4

  47. [47]

    In: ICCV (2021) 5 18 A

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021) 5 18 A. Ramanana Rahary et al

  48. [48]

    IEEE TPAMI (2022) 5

    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2022) 5

  49. [49]

    CVPR (2024) 4

    Reddy, P., Elezi, I., Deng, J.: G3dr: Generative 3d reconstruction in imagenet. CVPR (2024) 4

  50. [50]

    In: CVPR (2025) 10

    Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: CVPR (2025) 10

  51. [51]

    arXiv (2021) 3, 8, 14

    Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv (2021) 3, 8, 14

  52. [52]

    In: ICCV

    Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: Transformers and no 3d priors. In: ICCV. pp. 14356–14366 (2021) 4, 5, 9, 12, 26

  53. [53]

    In: CVPR (2022) 4, 5

    Sajjadi,M.S.M.,Meyer,H.,Pot,E.,Bergmann, U.,Greff,K., Radwan,N., Vora,S., Lucic,M.,Duckworth,D.,Dosovitskiy,A.,Uszkoreit,J.,Funkhouser,T.,Tagliasac- chi, A.: Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In: CVPR (2022) 4, 5

  54. [54]

    arXiv (2023) 5

    Sargent, K., Li, Z., Shah, T., Herrmann, C., Yu, H.X., Zhang, Y., Chan, E.R., La- gun, D., Fei-Fei, L., Sun, D., Wu, J.: ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. arXiv (2023) 5

  55. [55]

    In: Proc

    Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: Stylegan-t: unlocking the power of gans for fast large-scale text-to-image synthesis. In: Proc. ICML (2023) 8

  56. [56]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

    Schwarz,K.,Liao,Y.,Niemeyer,M.,Geiger,A.:Graf:Generativeradiancefieldsfor 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

  57. [57]

    arXiv (2024) 5

    Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv (2024) 5

  58. [58]

    In: CVPR (2020) 5

    Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3d photography using context-aware layered depth inpainting. In: CVPR (2020) 5

  59. [59]

    arXiv (2025) 7, 25

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3. arXiv (2025) 7, 25

  60. [60]

    3DV (2025) 3

    Szymanowicz, S., Insafutdinov, E., Zheng, C., Campbell, D., Henriques, J., Rup- precht,C.,Vedaldi,A.:Flash3d:Feed-forwardgeneralisable3dscenereconstruction from a single image. 3DV (2025) 3

  61. [61]

    In: CVPR (2024) 3

    Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. In: CVPR (2024) 3

  62. [62]

    In: CVPR (2020) 5

    Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR (2020) 5

  63. [63]

    In: CVPR (2025) 8

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR (2025) 8

  64. [64]

    In: CVPR (2021) 3

    Wang, Q., Wang, Z., Genova, K., Srinivasan, P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021) 3

  65. [65]

    In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

    Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

  66. [66]

    In: NeurIPS (2025) 5, 7, 9, 24

    Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. In: NeurIPS (2025) 5, 7, 9, 24

  67. [67]

    In: CVPR (2024) 4

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR (2024) 4

  68. [68]

    In: CVPR (2020) 4, 5

    Wiles,O.,Gkioxari,G.,Szeliski,R.,Johnson,J.:SynSin:End-to-endviewsynthesis from a single image. In: CVPR (2020) 4, 5

  69. [69]

    In: ECCV (2025) 4

    Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: ECCV (2025) 4

  70. [70]

    In: CVPR (2025) 3

    Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: CVPR (2025) 3

  71. [71]

    In: CVPR (2024) 5

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024) 5

  72. [72]

    In: CVPR (2021) 3

    Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: Neural radiance fields from one or few images. In: CVPR (2021) 3

  73. [73]

    In: ICCV (2023) 4, 5, 9, 12, 26

    Yu, J.J., Forghani, F., Derpanis, K.G., Brubaker, M.A.: Long-term photometric consistent novel view synthesis with diffusion models. In: ICCV (2023) 4, 5, 9, 12, 26

  74. [74]

    IEEE TPAMI (2024) 4

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE TPAMI (2024) 4

  75. [75]

    In: CVPR (2023) 4

    Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Liang, T., Chen, G., Cui, S., Han, X.: Mvimgnet: A large-scale dataset of multi-view images. In: CVPR (2023) 4

  76. [76]

    In: CVPR (2018) 7

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 7

  77. [77]

    In: ICLR (2026) 25

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) 25

  78. [78]

    arXiv (2016) 3, 8, 14

    Zhou, B., Khosla, A., Lapedriza, À., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding. arXiv (2016) 3, 8, 14

  79. [79]

    arXiv (2025) 3, 4

    Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv (2025) 3, 4

  80. [80]

    SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

    Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn- ing view synthesis using multiplane images. SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

Showing first 80 references.