One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Adrien Ramanana Rahary; David Picard; Nicolas Dufour; Patrick Perez

arxiv: 2603.23488 · v2 · submitted 2026-03-24 · 💻 cs.CV

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Adrien Ramanana Rahary , Nicolas Dufour , Patrick Perez , David Picard This is my paper

Pith reviewed 2026-05-15 00:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords novel view synthesismonocular depthunpaired trainingmasked losseszero-shotin-the-wild imagesgeometry-free inference

0 comments

The pith

A single unpaired image plus monocular depth is enough to train novel view synthesis at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that novel view synthesis no longer needs paired multi-view images for supervision. Instead, a monocular depth estimator lifts each source photo into 3D space, a random camera shift creates a pseudo-target view, and masked losses ignore the disoccluded pixels so training can proceed on raw internet data. This setup trains on 30 million uncurated photos and produces a model that needs no depth or 3D structure at inference time. The resulting system beats prior methods in zero-shot tests on diverse scenes while running 600 times faster.

Core claim

OVIE is trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation.

What carries the argument

Monocular depth estimator serving as geometric scaffold to lift single images into 3D, followed by camera transformation and masked reprojection losses that apply only to valid pixels.

If this is right

Novel view synthesis models can now be trained on tens of millions of single-view internet photos instead of scarce curated multi-view sets.
At test time the model requires only the input image and produces new views without any depth or 3D representation.
Zero-shot performance on in-the-wild images exceeds that of earlier methods trained with explicit multi-view pairs.
Training becomes practical at the scale of large uncurated web collections while inference runs hundreds of times faster.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting-and-masking idea could be applied to train other geometry tasks such as surface normal estimation or object insertion using only single images.
Removing the paired-data requirement opens the door to training on video frames treated as independent views.
The speed gain suggests deployment on mobile devices for real-time view synthesis in AR or video editing.

Load-bearing premise

The monocular depth estimates are accurate enough that the pseudo-target views they produce contain usable training signals despite any depth errors.

What would settle it

Run OVIE and a paired-supervision baseline on the same held-out multi-view dataset of real-world scenes and measure whether OVIE's novel-view error is at least as low as the baseline.

Figures

Figures reproduced from arXiv: 2603.23488 by Adrien Ramanana Rahary, David Picard, Nicolas Dufour, Patrick Perez.

**Figure 2.** Figure 2: Method overview. Top: From web-sourced images I0, a frozen monocular depth estimator extracts per-image 3D point clouds P. We then sample camera transformations T0→1 ∈ SE(3) (rotation and translation), apply them to the point clouds, and reproject to generate pseudo-target views I ∗ 1 . Bottom: Our model fθ takes a source image I0 and, conditioned on a camera transformation T0→1, predicts the correspondi… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Metric scale understanding. The same 20 cm camera translation is applied to two scenes of different physical scales. The close-up banana (left, 50 cm away) undergoes a large apparent displacement, while the room-scale scene (right, 3 m away) shows a proportionally smaller shift consistent with metrically correct parallax. pretrained models released by their authors. This creates a deliberate asymmetry: o… view at source ↗

**Figure 5.** Figure 5: Scaling with dataset size. PSNR and FID on RealEstate10K as a function of training set size. Both metrics improve consistently as data volume increases. SSIM and LPIPS curves, which follow the same trend, are reported in the Supplementary. Effect of data diversity. To isolate diversity from scale, we train four models each on a single data source subsampled to 2M images (the size of our smallest source, Pl… view at source ↗

**Figure 6.** Figure 6: Quality vs. Inference tradeoff on DL3DV. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Scaling with dataset size – SSIM and LPIPS. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Quality vs. Inference tradeoff on DL3DV – SSIM and LPIPS. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results on out-of-distribution images. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of source inputs, training pseudo-targets, and gener [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison with InfiniteNature-Zero [ [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗

read the original abstract

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OVIE trains novel view synthesis on unpaired images by lifting them with monocular depth, warping to pseudo-targets, and applying masked losses, then drops all geometry at inference.

read the letter

OVIE shows you can drop the multi-view requirement for novel view synthesis. The method lifts a single image with an off-the-shelf depth estimator, applies a random rigid transform, projects to a pseudo-target, and trains only on the valid pixels using masked photometric, perceptual, and textural losses. This runs on 30 million uncurated internet images and produces a model that needs no depth or 3D structure at test time. Inference is 600 times faster than the next baseline while beating prior zero-shot numbers.

Referee Report

2 major / 2 minor

Summary. The paper introduces OVIE, a monocular novel-view synthesis model trained exclusively on unpaired in-the-wild internet images. It uses a pre-trained monocular depth estimator to lift source images into 3D, applies random rigid transforms, and projects to create pseudo-target views, with a masked formulation of photometric, perceptual, and textural losses restricted to valid (non-disoccluded) regions. At inference the model is geometry-free. The central claim is that this yields superior zero-shot performance on novel-view synthesis benchmarks while being 600x faster than the next-best baseline.

Significance. If the empirical claims hold, the work would be significant for scaling novel-view synthesis to diverse, uncurated data at internet scale without requiring multi-view supervision. The geometry-free inference and reported speed advantage could broaden practical applicability. The public release of code and models is a positive contribution for reproducibility.

major comments (2)

[§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.
[§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.

minor comments (2)

[Figure 2] Figure 2 and associated text: the visualization of the lifting-projection pipeline would benefit from explicit annotation of the mask M and the regions excluded from the loss.
[Related Work] Related work section: the discussion of prior monocular depth-based methods could more explicitly contrast the proposed masked formulation against existing warping-based approaches.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and positive assessment of the significance of our work. We address each major comment below and have revised the manuscript to improve clarity and provide additional analysis where feasible.

read point-by-point responses

Referee: [§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.

Authors: We agree that quantifying the effect of depth inaccuracies would strengthen the paper. Because our training uses unpaired in-the-wild images, ground-truth depth is unavailable and a full GT-depth ablation is not possible. We have added a new ablation in the revised manuscript that replaces the depth estimator with a noisy variant (introducing scale drift, boundary misalignment, and holes) and measures the resulting degradation in novel-view quality. The results show that the masked loss formulation limits error propagation by excluding disoccluded and invalid pixels, supporting the original claim. We have also expanded the discussion in §3 to describe this mechanism explicitly. revision: partial
Referee: [§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.

Authors: We thank the referee for this request for clarification. In the revised manuscript we have expanded §4 with: (i) a table listing every baseline together with its original training data scale and supervision type, (ii) explicit confirmation that the monocular depth estimator is pre-trained and kept frozen throughout OVIE training, and (iii) an additional controlled experiment that isolates the contribution of the masked loss from the depth prior. These additions make clear that the reported gains arise from the monocular training procedure on 30 M unpaired images rather than from the depth estimator alone. revision: yes

standing simulated objections not resolved

A controlled experiment that replaces the depth estimator with ground-truth depth on the in-the-wild unpaired training set, as no such ground-truth depth exists for internet-scale images.

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper's central training procedure lifts images using an external pre-trained monocular depth estimator (not fitted or defined within the work) to synthesize pseudo-target views via rigid transforms and projection, then applies standard masked photometric/perceptual/textural losses only on valid regions. This objective is not equivalent to the inference-time novel-view output by construction, as the network must learn a generalizable mapping that operates without depth or 3D at test time. No load-bearing self-citations, self-definitional steps, or fitted parameters renamed as predictions appear in the provided text; the method is self-contained against external benchmarks and does not reduce the claimed result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of an external monocular depth estimator for creating training targets; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Monocular depth estimators provide sufficiently accurate 3D lifts to serve as geometric scaffolds for pseudo-view generation
Invoked to lift source images into 3D and project after camera transformation.

pith-pipeline@v0.9.0 · 5471 in / 1172 out tokens · 31771 ms · 2026-05-15T00:31:48.749718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view... masked training formulation that restricts geometric, perceptual, and textural losses to valid regions
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

monocular depth estimator as a geometric scaffold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages

[1]

CVPR (2024) 3, 8, 14

Astruc, G., Dufour, N., Siglidis, I., Aronssohn, C., Bouia, N., Fu, S., Loiseau, R., Nguyen, V.N., Raude, C., Vincent, E., Xu, L., Zhou, H., Landrieu, L.: OpenStreetView-5M: The many roads to global visual geolocation. CVPR (2024) 3, 8, 14

work page 2024
[2]

arXiv (2025) 4, 10

Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv (2025) 4, 10

work page 2025
[3]

arXiv (2023) 5

Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv (2023) 5

work page 2023
[4]

arXiv (2023) 5

Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv (2023) 5

work page 2023
[5]

In: ICCV (2021) 7

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 7

work page 2021
[6]

In: CVPR (2021) 4

Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021) 4

work page 2021
[7]

In: CVPR (2022) 4

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR (2022) 4

work page 2022
[8]

In: CVPR (2024) 3

Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: CVPR (2024) 3

work page 2024
[9]

IEEE Transactions on Image Pro- cessing (1997) 7, 12

Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Deterministic edge- preserving regularization in computed imaging. IEEE Transactions on Image Pro- cessing (1997) 7, 12

work page 1997
[10]

arXiv (2024) 3 16 A

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv (2024) 3 16 A. Ramanana Rahary et al

work page 2024
[11]

In: NeurIPS (2024) 3

Chen, Y., Zheng, C., Xu, H., Zhuang, B., Vedaldi, A., Cham, T.J., Cai, J.: Mvs- plat360: Feed-forward 360 scene synthesis from sparse views. In: NeurIPS (2024) 3

work page 2024
[12]

arXiv (2023) 5

Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv (2023) 5

work page 2023
[13]

In: CVPR (2017) 4

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017) 4

work page 2017
[14]

NeurIPS (2023) 2

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS (2023) 2

work page 2023
[15]

CVPR (2023) 2, 4

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023) 2, 4

work page 2023
[16]

ICLR (2021) 10

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021) 10

work page 2021
[17]

CVPR (2025) 4, 5, 9, 12, 26

Elata, N., Kawar, B., Ostrovsky-Berman, Y., Farber, M., Sokolovsky, R.: Novel view synthesis with pixel-space diffusion models. CVPR (2025) 4, 5, 9, 12, 26

work page 2025
[18]

In: Proc

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proc. ICML (2024) 4

work page 2024
[19]

In: CVPR (2021) 8, 25

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021) 8, 25

work page 2021
[20]

In: NeurIPS (2022) 4

Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: a generative model of high quality 3d textured shapes learned from images. In: NeurIPS (2022) 4

work page 2022
[21]

CVPR (2017) 7, 25

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. CVPR (2017) 7, 25

work page 2017
[22]

In: ICCV (2021) 5

Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., Kaeser, D., Freeman, W.T., Salesin, D., Curless, B., et al.: Slide: Single image 3d photography with soft layering and depth-aware inpainting. In: ICCV (2021) 5

work page 2021
[23]

In: CVPR (2024) 4

Jang, W., Agapito, L.: Nvist: In the wild new view synthesis from a single image with transformers. In: CVPR (2024) 4

work page 2024
[24]

In: ICLR (2025) 3

Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. In: ICLR (2025) 3

work page 2025
[25]

In: CVPR (2024) 5

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: CVPR (2024) 5

work page 2024
[26]

IEEE TPAMI (2025) 5

Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. IEEE TPAMI (2025) 5

work page 2025
[27]

SIGGRAPH (2023) 3

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. SIGGRAPH (2023) 3

work page 2023
[28]

IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Generation 17

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual rela- tionship detection at scale. IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Ge...

work page 2020
[29]

arXiv (2025) 4

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv (2025) 4

work page 2025
[30]

In: ECCV (2022) 4, 20, 27, 36

Li, Z., Wang, Q., Snavely, N., Kanazawa, A.: Infinitenature-zero: Learning perpet- ual view generation of natural scenes from single images. In: ECCV (2022) 4, 20, 27, 36

work page 2022
[31]

In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

work page 2024
[32]

In: ICCV (2021) 5

Liu, A., Makadia, A., Tucker, R., Snavely, N., Jampani, V., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: ICCV (2021) 5

work page 2021
[33]

In: ICCV (2023) 4, 5

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: ICCV (2023) 4, 5

work page 2023
[34]

arXiv (2026) 7

Ma, Z., Xu, R., Zhang, S.: Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. arXiv (2026) 7

work page 2026
[35]

In: ICCV (2025) 4

Maillard, L., Durand, T., Rahary, A.R., Ovsjanikov, M.: Laconic: A 3d layout adapter for controllable image creation. In: ICCV (2025) 4

work page 2025
[36]

In: ICLR (2026) 3

Mescheder, L., Dong, W., Li, S., Bai, X., Santos, M., Hu, P., Lecouat, B., Zhen, M., Delaunoy, A., Fang, T., Tsin, Y., Richter, S.R., Koltun, V.: Sharp monocular view synthesis in less than a second. In: ICLR (2026) 3

work page 2026
[37]

In: ECCV (2020) 3

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020) 3

work page 2020
[38]

In: CVPR (2024) 5

Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: CVPR (2024) 5

work page 2024
[39]

In: ICCV (2019) 4

Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsuper- vised learning of 3d representations from natural images. In: ICCV (2019) 4

work page 2019
[40]

In: CVPR (2021) 4

Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: CVPR (2021) 4

work page 2021
[41]

ACM Transactions on Graphics (2019) 5

Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. ACM Transactions on Graphics (2019) 5

work page 2019
[42]

arXiv (2023) 7

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

work page 2023
[43]

In: ICCV (2023) 8, 24

Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers . In: ICCV (2023) 8, 24

work page 2023
[44]

arXiv (2025) 5

Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Gool, L.V.: UniDepthV2: Universal monocular metric depth estimation made simpler. arXiv (2025) 5

work page 2025
[45]

In: CVPR (2024) 5

Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: UniDepth: Universal monocular metric depth estimation. In: CVPR (2024) 5

work page 2024
[46]

arXiv (2021) 4

Ramirez, P.Z., Tonioni, A., Tombari, F.: Unsupervised novel view synthesis from a single image. arXiv (2021) 4

work page 2021
[47]

In: ICCV (2021) 5 18 A

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021) 5 18 A. Ramanana Rahary et al

work page 2021
[48]

IEEE TPAMI (2022) 5

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2022) 5

work page 2022
[49]

CVPR (2024) 4

Reddy, P., Elezi, I., Deng, J.: G3dr: Generative 3d reconstruction in imagenet. CVPR (2024) 4

work page 2024
[50]

In: CVPR (2025) 10

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: CVPR (2025) 10

work page 2025
[51]

arXiv (2021) 3, 8, 14

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv (2021) 3, 8, 14

work page 2021
[52]

In: ICCV

Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: Transformers and no 3d priors. In: ICCV. pp. 14356–14366 (2021) 4, 5, 9, 12, 26

work page 2021
[53]

In: CVPR (2022) 4, 5

Sajjadi,M.S.M.,Meyer,H.,Pot,E.,Bergmann, U.,Greff,K., Radwan,N., Vora,S., Lucic,M.,Duckworth,D.,Dosovitskiy,A.,Uszkoreit,J.,Funkhouser,T.,Tagliasac- chi, A.: Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In: CVPR (2022) 4, 5

work page 2022
[54]

arXiv (2023) 5

Sargent, K., Li, Z., Shah, T., Herrmann, C., Yu, H.X., Zhang, Y., Chan, E.R., La- gun, D., Fei-Fei, L., Sun, D., Wu, J.: ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. arXiv (2023) 5

work page 2023
[55]

In: Proc

Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: Stylegan-t: unlocking the power of gans for fast large-scale text-to-image synthesis. In: Proc. ICML (2023) 8

work page 2023
[56]

In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

Schwarz,K.,Liao,Y.,Niemeyer,M.,Geiger,A.:Graf:Generativeradiancefieldsfor 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

work page 2020
[57]

arXiv (2024) 5

Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv (2024) 5

work page 2024
[58]

In: CVPR (2020) 5

Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3d photography using context-aware layered depth inpainting. In: CVPR (2020) 5

work page 2020
[59]

arXiv (2025) 7, 25

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3. arXiv (2025) 7, 25

work page 2025
[60]

3DV (2025) 3

Szymanowicz, S., Insafutdinov, E., Zheng, C., Campbell, D., Henriques, J., Rup- precht,C.,Vedaldi,A.:Flash3d:Feed-forwardgeneralisable3dscenereconstruction from a single image. 3DV (2025) 3

work page 2025
[61]

In: CVPR (2024) 3

Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. In: CVPR (2024) 3

work page 2024
[62]

In: CVPR (2020) 5

Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR (2020) 5

work page 2020
[63]

In: CVPR (2025) 8

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR (2025) 8

work page 2025
[64]

In: CVPR (2021) 3

Wang, Q., Wang, Z., Genova, K., Srinivasan, P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021) 3

work page 2021
[65]

In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

work page 2025
[66]

In: NeurIPS (2025) 5, 7, 9, 24

Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. In: NeurIPS (2025) 5, 7, 9, 24

work page 2025
[67]

In: CVPR (2024) 4

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR (2024) 4

work page 2024
[68]

In: CVPR (2020) 4, 5

Wiles,O.,Gkioxari,G.,Szeliski,R.,Johnson,J.:SynSin:End-to-endviewsynthesis from a single image. In: CVPR (2020) 4, 5

work page 2020
[69]

In: ECCV (2025) 4

Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: ECCV (2025) 4

work page 2025
[70]

In: CVPR (2025) 3

Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: CVPR (2025) 3

work page 2025
[71]

In: CVPR (2024) 5

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024) 5

work page 2024
[72]

In: CVPR (2021) 3

Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: Neural radiance fields from one or few images. In: CVPR (2021) 3

work page 2021
[73]

In: ICCV (2023) 4, 5, 9, 12, 26

Yu, J.J., Forghani, F., Derpanis, K.G., Brubaker, M.A.: Long-term photometric consistent novel view synthesis with diffusion models. In: ICCV (2023) 4, 5, 9, 12, 26

work page 2023
[74]

IEEE TPAMI (2024) 4

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE TPAMI (2024) 4

work page 2024
[75]

In: CVPR (2023) 4

Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Liang, T., Chen, G., Cui, S., Han, X.: Mvimgnet: A large-scale dataset of multi-view images. In: CVPR (2023) 4

work page 2023
[76]

In: CVPR (2018) 7

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 7

work page 2018
[77]

In: ICLR (2026) 25

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) 25

work page 2026
[78]

arXiv (2016) 3, 8, 14

Zhou, B., Khosla, A., Lapedriza, À., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding. arXiv (2016) 3, 8, 14

work page 2016
[79]

arXiv (2025) 3, 4

Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv (2025) 3, 4

work page 2025
[80]

SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn- ing view synthesis using multiplane images. SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

work page 2018

Showing first 80 references.

[1] [1]

CVPR (2024) 3, 8, 14

Astruc, G., Dufour, N., Siglidis, I., Aronssohn, C., Bouia, N., Fu, S., Loiseau, R., Nguyen, V.N., Raude, C., Vincent, E., Xu, L., Zhou, H., Landrieu, L.: OpenStreetView-5M: The many roads to global visual geolocation. CVPR (2024) 3, 8, 14

work page 2024

[2] [2]

arXiv (2025) 4, 10

Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv (2025) 4, 10

work page 2025

[3] [3]

arXiv (2023) 5

Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv (2023) 5

work page 2023

[4] [4]

arXiv (2023) 5

Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv (2023) 5

work page 2023

[5] [5]

In: ICCV (2021) 7

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 7

work page 2021

[6] [6]

In: CVPR (2021) 4

Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021) 4

work page 2021

[7] [7]

In: CVPR (2022) 4

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR (2022) 4

work page 2022

[8] [8]

In: CVPR (2024) 3

Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: CVPR (2024) 3

work page 2024

[9] [9]

IEEE Transactions on Image Pro- cessing (1997) 7, 12

Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Deterministic edge- preserving regularization in computed imaging. IEEE Transactions on Image Pro- cessing (1997) 7, 12

work page 1997

[10] [10]

arXiv (2024) 3 16 A

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv (2024) 3 16 A. Ramanana Rahary et al

work page 2024

[11] [11]

In: NeurIPS (2024) 3

Chen, Y., Zheng, C., Xu, H., Zhuang, B., Vedaldi, A., Cham, T.J., Cai, J.: Mvs- plat360: Feed-forward 360 scene synthesis from sparse views. In: NeurIPS (2024) 3

work page 2024

[12] [12]

arXiv (2023) 5

Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv (2023) 5

work page 2023

[13] [13]

In: CVPR (2017) 4

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017) 4

work page 2017

[14] [14]

NeurIPS (2023) 2

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS (2023) 2

work page 2023

[15] [15]

CVPR (2023) 2, 4

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023) 2, 4

work page 2023

[16] [16]

ICLR (2021) 10

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021) 10

work page 2021

[17] [17]

CVPR (2025) 4, 5, 9, 12, 26

Elata, N., Kawar, B., Ostrovsky-Berman, Y., Farber, M., Sokolovsky, R.: Novel view synthesis with pixel-space diffusion models. CVPR (2025) 4, 5, 9, 12, 26

work page 2025

[18] [18]

In: Proc

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proc. ICML (2024) 4

work page 2024

[19] [19]

In: CVPR (2021) 8, 25

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021) 8, 25

work page 2021

[20] [20]

In: NeurIPS (2022) 4

Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: a generative model of high quality 3d textured shapes learned from images. In: NeurIPS (2022) 4

work page 2022

[21] [21]

CVPR (2017) 7, 25

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. CVPR (2017) 7, 25

work page 2017

[22] [22]

In: ICCV (2021) 5

Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., Kaeser, D., Freeman, W.T., Salesin, D., Curless, B., et al.: Slide: Single image 3d photography with soft layering and depth-aware inpainting. In: ICCV (2021) 5

work page 2021

[23] [23]

In: CVPR (2024) 4

Jang, W., Agapito, L.: Nvist: In the wild new view synthesis from a single image with transformers. In: CVPR (2024) 4

work page 2024

[24] [24]

In: ICLR (2025) 3

Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. In: ICLR (2025) 3

work page 2025

[25] [25]

In: CVPR (2024) 5

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: CVPR (2024) 5

work page 2024

[26] [26]

IEEE TPAMI (2025) 5

Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. IEEE TPAMI (2025) 5

work page 2025

[27] [27]

SIGGRAPH (2023) 3

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. SIGGRAPH (2023) 3

work page 2023

[28] [28]

IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Generation 17

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual rela- tionship detection at scale. IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Ge...

work page 2020

[29] [29]

arXiv (2025) 4

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv (2025) 4

work page 2025

[30] [30]

In: ECCV (2022) 4, 20, 27, 36

Li, Z., Wang, Q., Snavely, N., Kanazawa, A.: Infinitenature-zero: Learning perpet- ual view generation of natural scenes from single images. In: ECCV (2022) 4, 20, 27, 36

work page 2022

[31] [31]

In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20

work page 2024

[32] [32]

In: ICCV (2021) 5

Liu, A., Makadia, A., Tucker, R., Snavely, N., Jampani, V., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: ICCV (2021) 5

work page 2021

[33] [33]

In: ICCV (2023) 4, 5

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: ICCV (2023) 4, 5

work page 2023

[34] [34]

arXiv (2026) 7

Ma, Z., Xu, R., Zhang, S.: Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. arXiv (2026) 7

work page 2026

[35] [35]

In: ICCV (2025) 4

Maillard, L., Durand, T., Rahary, A.R., Ovsjanikov, M.: Laconic: A 3d layout adapter for controllable image creation. In: ICCV (2025) 4

work page 2025

[36] [36]

In: ICLR (2026) 3

Mescheder, L., Dong, W., Li, S., Bai, X., Santos, M., Hu, P., Lecouat, B., Zhen, M., Delaunoy, A., Fang, T., Tsin, Y., Richter, S.R., Koltun, V.: Sharp monocular view synthesis in less than a second. In: ICLR (2026) 3

work page 2026

[37] [37]

In: ECCV (2020) 3

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020) 3

work page 2020

[38] [38]

In: CVPR (2024) 5

Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: CVPR (2024) 5

work page 2024

[39] [39]

In: ICCV (2019) 4

Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsuper- vised learning of 3d representations from natural images. In: ICCV (2019) 4

work page 2019

[40] [40]

In: CVPR (2021) 4

Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: CVPR (2021) 4

work page 2021

[41] [41]

ACM Transactions on Graphics (2019) 5

Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. ACM Transactions on Graphics (2019) 5

work page 2019

[42] [42]

arXiv (2023) 7

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

work page 2023

[43] [43]

In: ICCV (2023) 8, 24

Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers . In: ICCV (2023) 8, 24

work page 2023

[44] [44]

arXiv (2025) 5

Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Gool, L.V.: UniDepthV2: Universal monocular metric depth estimation made simpler. arXiv (2025) 5

work page 2025

[45] [45]

In: CVPR (2024) 5

Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: UniDepth: Universal monocular metric depth estimation. In: CVPR (2024) 5

work page 2024

[46] [46]

arXiv (2021) 4

Ramirez, P.Z., Tonioni, A., Tombari, F.: Unsupervised novel view synthesis from a single image. arXiv (2021) 4

work page 2021

[47] [47]

In: ICCV (2021) 5 18 A

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021) 5 18 A. Ramanana Rahary et al

work page 2021

[48] [48]

IEEE TPAMI (2022) 5

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2022) 5

work page 2022

[49] [49]

CVPR (2024) 4

Reddy, P., Elezi, I., Deng, J.: G3dr: Generative 3d reconstruction in imagenet. CVPR (2024) 4

work page 2024

[50] [50]

In: CVPR (2025) 10

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: CVPR (2025) 10

work page 2025

[51] [51]

arXiv (2021) 3, 8, 14

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv (2021) 3, 8, 14

work page 2021

[52] [52]

In: ICCV

Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: Transformers and no 3d priors. In: ICCV. pp. 14356–14366 (2021) 4, 5, 9, 12, 26

work page 2021

[53] [53]

In: CVPR (2022) 4, 5

Sajjadi,M.S.M.,Meyer,H.,Pot,E.,Bergmann, U.,Greff,K., Radwan,N., Vora,S., Lucic,M.,Duckworth,D.,Dosovitskiy,A.,Uszkoreit,J.,Funkhouser,T.,Tagliasac- chi, A.: Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In: CVPR (2022) 4, 5

work page 2022

[54] [54]

arXiv (2023) 5

Sargent, K., Li, Z., Shah, T., Herrmann, C., Yu, H.X., Zhang, Y., Chan, E.R., La- gun, D., Fei-Fei, L., Sun, D., Wu, J.: ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. arXiv (2023) 5

work page 2023

[55] [55]

In: Proc

Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: Stylegan-t: unlocking the power of gans for fast large-scale text-to-image synthesis. In: Proc. ICML (2023) 8

work page 2023

[56] [56]

In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

Schwarz,K.,Liao,Y.,Niemeyer,M.,Geiger,A.:Graf:Generativeradiancefieldsfor 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4

work page 2020

[57] [57]

arXiv (2024) 5

Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv (2024) 5

work page 2024

[58] [58]

In: CVPR (2020) 5

Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3d photography using context-aware layered depth inpainting. In: CVPR (2020) 5

work page 2020

[59] [59]

arXiv (2025) 7, 25

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3. arXiv (2025) 7, 25

work page 2025

[60] [60]

3DV (2025) 3

Szymanowicz, S., Insafutdinov, E., Zheng, C., Campbell, D., Henriques, J., Rup- precht,C.,Vedaldi,A.:Flash3d:Feed-forwardgeneralisable3dscenereconstruction from a single image. 3DV (2025) 3

work page 2025

[61] [61]

In: CVPR (2024) 3

Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. In: CVPR (2024) 3

work page 2024

[62] [62]

In: CVPR (2020) 5

Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR (2020) 5

work page 2020

[63] [63]

In: CVPR (2025) 8

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR (2025) 8

work page 2025

[64] [64]

In: CVPR (2021) 3

Wang, Q., Wang, Z., Genova, K., Srinivasan, P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021) 3

work page 2021

[65] [65]

In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19

work page 2025

[66] [66]

In: NeurIPS (2025) 5, 7, 9, 24

Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. In: NeurIPS (2025) 5, 7, 9, 24

work page 2025

[67] [67]

In: CVPR (2024) 4

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR (2024) 4

work page 2024

[68] [68]

In: CVPR (2020) 4, 5

Wiles,O.,Gkioxari,G.,Szeliski,R.,Johnson,J.:SynSin:End-to-endviewsynthesis from a single image. In: CVPR (2020) 4, 5

work page 2020

[69] [69]

In: ECCV (2025) 4

Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: ECCV (2025) 4

work page 2025

[70] [70]

In: CVPR (2025) 3

Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: CVPR (2025) 3

work page 2025

[71] [71]

In: CVPR (2024) 5

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024) 5

work page 2024

[72] [72]

In: CVPR (2021) 3

Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: Neural radiance fields from one or few images. In: CVPR (2021) 3

work page 2021

[73] [73]

In: ICCV (2023) 4, 5, 9, 12, 26

Yu, J.J., Forghani, F., Derpanis, K.G., Brubaker, M.A.: Long-term photometric consistent novel view synthesis with diffusion models. In: ICCV (2023) 4, 5, 9, 12, 26

work page 2023

[74] [74]

IEEE TPAMI (2024) 4

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE TPAMI (2024) 4

work page 2024

[75] [75]

In: CVPR (2023) 4

Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Liang, T., Chen, G., Cui, S., Han, X.: Mvimgnet: A large-scale dataset of multi-view images. In: CVPR (2023) 4

work page 2023

[76] [76]

In: CVPR (2018) 7

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 7

work page 2018

[77] [77]

In: ICLR (2026) 25

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) 25

work page 2026

[78] [78]

arXiv (2016) 3, 8, 14

Zhou, B., Khosla, A., Lapedriza, À., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding. arXiv (2016) 3, 8, 14

work page 2016

[79] [79]

arXiv (2025) 3, 4

Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv (2025) 3, 4

work page 2025

[80] [80]

SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn- ing view synthesis using multiplane images. SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26

work page 2018