One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
Pith reviewed 2026-05-15 00:31 UTC · model grok-4.3
The pith
A single unpaired image plus monocular depth is enough to train novel view synthesis at scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OVIE is trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation.
What carries the argument
Monocular depth estimator serving as geometric scaffold to lift single images into 3D, followed by camera transformation and masked reprojection losses that apply only to valid pixels.
If this is right
- Novel view synthesis models can now be trained on tens of millions of single-view internet photos instead of scarce curated multi-view sets.
- At test time the model requires only the input image and produces new views without any depth or 3D representation.
- Zero-shot performance on in-the-wild images exceeds that of earlier methods trained with explicit multi-view pairs.
- Training becomes practical at the scale of large uncurated web collections while inference runs hundreds of times faster.
Where Pith is reading between the lines
- The same lifting-and-masking idea could be applied to train other geometry tasks such as surface normal estimation or object insertion using only single images.
- Removing the paired-data requirement opens the door to training on video frames treated as independent views.
- The speed gain suggests deployment on mobile devices for real-time view synthesis in AR or video editing.
Load-bearing premise
The monocular depth estimates are accurate enough that the pseudo-target views they produce contain usable training signals despite any depth errors.
What would settle it
Run OVIE and a paired-supervision baseline on the same held-out multi-view dataset of real-world scenes and measure whether OVIE's novel-view error is at least as low as the baseline.
Figures
read the original abstract
Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OVIE, a monocular novel-view synthesis model trained exclusively on unpaired in-the-wild internet images. It uses a pre-trained monocular depth estimator to lift source images into 3D, applies random rigid transforms, and projects to create pseudo-target views, with a masked formulation of photometric, perceptual, and textural losses restricted to valid (non-disoccluded) regions. At inference the model is geometry-free. The central claim is that this yields superior zero-shot performance on novel-view synthesis benchmarks while being 600x faster than the next-best baseline.
Significance. If the empirical claims hold, the work would be significant for scaling novel-view synthesis to diverse, uncurated data at internet scale without requiring multi-view supervision. The geometry-free inference and reported speed advantage could broaden practical applicability. The public release of code and models is a positive contribution for reproducibility.
major comments (2)
- [§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.
- [§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.
minor comments (2)
- [Figure 2] Figure 2 and associated text: the visualization of the lifting-projection pipeline would benefit from explicit annotation of the mask M and the regions excluded from the loss.
- [Related Work] Related work section: the discussion of prior monocular depth-based methods could more explicitly contrast the proposed masked formulation against existing warping-based approaches.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of the significance of our work. We address each major comment below and have revised the manuscript to improve clarity and provide additional analysis where feasible.
read point-by-point responses
-
Referee: [§3] §3 (Method), masked loss formulation: the claim that restricting losses to the valid mask M isolates the training signal from depth errors is load-bearing for the central argument, yet no ablation quantifies how depth inaccuracies (scale drift, boundary misalignment, or surface holes) propagate into the learned representation when errors are spatially correlated. A controlled experiment replacing the depth estimator with ground-truth or noisy variants would directly test this.
Authors: We agree that quantifying the effect of depth inaccuracies would strengthen the paper. Because our training uses unpaired in-the-wild images, ground-truth depth is unavailable and a full GT-depth ablation is not possible. We have added a new ablation in the revised manuscript that replaces the depth estimator with a noisy variant (introducing scale drift, boundary misalignment, and holes) and measures the resulting degradation in novel-view quality. The results show that the masked loss formulation limits error propagation by excluding disoccluded and invalid pixels, supporting the original claim. We have also expanded the discussion in §3 to describe this mechanism explicitly. revision: partial
-
Referee: [§4] §4 (Experiments), zero-shot evaluation: the reported outperformance and 600x speedup are central claims, but the manuscript provides insufficient detail on the exact baselines, training data scale for each comparator, and whether the monocular depth estimator is frozen or fine-tuned during OVIE training. Without these, it is difficult to attribute gains to the masked monocular training versus the strength of the depth prior.
Authors: We thank the referee for this request for clarification. In the revised manuscript we have expanded §4 with: (i) a table listing every baseline together with its original training data scale and supervision type, (ii) explicit confirmation that the monocular depth estimator is pre-trained and kept frozen throughout OVIE training, and (iii) an additional controlled experiment that isolates the contribution of the masked loss from the depth prior. These additions make clear that the reported gains arise from the monocular training procedure on 30 M unpaired images rather than from the depth estimator alone. revision: yes
- A controlled experiment that replaces the depth estimator with ground-truth depth on the in-the-wild unpaired training set, as no such ground-truth depth exists for internet-scale images.
Circularity Check
No circularity detected in derivation chain
full rationale
The paper's central training procedure lifts images using an external pre-trained monocular depth estimator (not fitted or defined within the work) to synthesize pseudo-target views via rigid transforms and projection, then applies standard masked photometric/perceptual/textural losses only on valid regions. This objective is not equivalent to the inference-time novel-view output by construction, as the network must learn a generalizable mapping that operates without depth or 3D at test time. No load-bearing self-citations, self-definitional steps, or fitted parameters renamed as predictions appear in the provided text; the method is self-contained against external benchmarks and does not reduce the claimed result to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Monocular depth estimators provide sufficiently accurate 3D lifts to serve as geometric scaffolds for pseudo-view generation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view... masked training formulation that restricts geometric, perceptual, and textural losses to valid regions
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
monocular depth estimator as a geometric scaffold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Astruc, G., Dufour, N., Siglidis, I., Aronssohn, C., Bouia, N., Fu, S., Loiseau, R., Nguyen, V.N., Raude, C., Vincent, E., Xu, L., Zhou, H., Landrieu, L.: OpenStreetView-5M: The many roads to global visual geolocation. CVPR (2024) 3, 8, 14
work page 2024
-
[2]
Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv (2025) 4, 10
work page 2025
-
[3]
Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv (2023) 5
work page 2023
-
[4]
Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv (2023) 5
work page 2023
-
[5]
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 7
work page 2021
-
[6]
Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021) 4
work page 2021
-
[7]
Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR (2022) 4
work page 2022
-
[8]
Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: CVPR (2024) 3
work page 2024
-
[9]
IEEE Transactions on Image Pro- cessing (1997) 7, 12
Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Deterministic edge- preserving regularization in computed imaging. IEEE Transactions on Image Pro- cessing (1997) 7, 12
work page 1997
-
[10]
Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv (2024) 3 16 A. Ramanana Rahary et al
work page 2024
-
[11]
Chen, Y., Zheng, C., Xu, H., Zhuang, B., Vedaldi, A., Cham, T.J., Cai, J.: Mvs- plat360: Feed-forward 360 scene synthesis from sparse views. In: NeurIPS (2024) 3
work page 2024
-
[12]
Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv (2023) 5
work page 2023
-
[13]
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017) 4
work page 2017
-
[14]
Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS (2023) 2
work page 2023
-
[15]
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023) 2, 4
work page 2023
-
[16]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021) 10
work page 2021
-
[17]
Elata, N., Kawar, B., Ostrovsky-Berman, Y., Farber, M., Sokolovsky, R.: Novel view synthesis with pixel-space diffusion models. CVPR (2025) 4, 5, 9, 12, 26
work page 2025
- [18]
-
[19]
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021) 8, 25
work page 2021
-
[20]
Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: a generative model of high quality 3d textured shapes learned from images. In: NeurIPS (2022) 4
work page 2022
-
[21]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. CVPR (2017) 7, 25
work page 2017
-
[22]
Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., Kaeser, D., Freeman, W.T., Salesin, D., Curless, B., et al.: Slide: Single image 3d photography with soft layering and depth-aware inpainting. In: ICCV (2021) 5
work page 2021
-
[23]
Jang, W., Agapito, L.: Nvist: In the wild new view synthesis from a single image with transformers. In: CVPR (2024) 4
work page 2024
-
[24]
Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. In: ICLR (2025) 3
work page 2025
-
[25]
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: CVPR (2024) 5
work page 2024
-
[26]
Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. IEEE TPAMI (2025) 5
work page 2025
-
[27]
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. SIGGRAPH (2023) 3
work page 2023
-
[28]
IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Generation 17
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual rela- tionship detection at scale. IJCV (2020) 3, 8, 14 OVIE : Monocular Training for In-the-Wild Novel View Ge...
work page 2020
-
[29]
Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv (2025) 4
work page 2025
-
[30]
Li, Z., Wang, Q., Snavely, N., Kanazawa, A.: Infinitenature-zero: Learning perpet- ual view generation of natural scenes from single images. In: ECCV (2022) 4, 20, 27, 36
work page 2022
-
[31]
In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR (2024) 2, 3, 4, 9, 12, 13, 14, 20
work page 2024
-
[32]
Liu, A., Makadia, A., Tucker, R., Snavely, N., Jampani, V., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: ICCV (2021) 5
work page 2021
-
[33]
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: ICCV (2023) 4, 5
work page 2023
-
[34]
Ma, Z., Xu, R., Zhang, S.: Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. arXiv (2026) 7
work page 2026
-
[35]
Maillard, L., Durand, T., Rahary, A.R., Ovsjanikov, M.: Laconic: A 3d layout adapter for controllable image creation. In: ICCV (2025) 4
work page 2025
-
[36]
Mescheder, L., Dong, W., Li, S., Bai, X., Santos, M., Hu, P., Lecouat, B., Zhen, M., Delaunoy, A., Fang, T., Tsin, Y., Richter, S.R., Koltun, V.: Sharp monocular view synthesis in less than a second. In: ICLR (2026) 3
work page 2026
-
[37]
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020) 3
work page 2020
-
[38]
Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: CVPR (2024) 5
work page 2024
-
[39]
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsuper- vised learning of 3d representations from natural images. In: ICCV (2019) 4
work page 2019
-
[40]
Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: CVPR (2021) 4
work page 2021
-
[41]
ACM Transactions on Graphics (2019) 5
Niklaus, S., Mai, L., Yang, J., Liu, F.: 3d ken burns effect from a single image. ACM Transactions on Graphics (2019) 5
work page 2019
-
[42]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...
work page 2023
-
[43]
Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers . In: ICCV (2023) 8, 24
work page 2023
-
[44]
Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Gool, L.V.: UniDepthV2: Universal monocular metric depth estimation made simpler. arXiv (2025) 5
work page 2025
-
[45]
Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: UniDepth: Universal monocular metric depth estimation. In: CVPR (2024) 5
work page 2024
-
[46]
Ramirez, P.Z., Tonioni, A., Tombari, F.: Unsupervised novel view synthesis from a single image. arXiv (2021) 4
work page 2021
-
[47]
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021) 5 18 A. Ramanana Rahary et al
work page 2021
-
[48]
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2022) 5
work page 2022
-
[49]
Reddy, P., Elezi, I., Deng, J.: G3dr: Generative 3d reconstruction in imagenet. CVPR (2024) 4
work page 2024
-
[50]
Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: CVPR (2025) 10
work page 2025
-
[51]
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv (2021) 3, 8, 14
work page 2021
- [52]
-
[53]
Sajjadi,M.S.M.,Meyer,H.,Pot,E.,Bergmann, U.,Greff,K., Radwan,N., Vora,S., Lucic,M.,Duckworth,D.,Dosovitskiy,A.,Uszkoreit,J.,Funkhouser,T.,Tagliasac- chi, A.: Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In: CVPR (2022) 4, 5
work page 2022
-
[54]
Sargent, K., Li, Z., Shah, T., Herrmann, C., Yu, H.X., Zhang, Y., Chan, E.R., La- gun, D., Fei-Fei, L., Sun, D., Wu, J.: ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. arXiv (2023) 5
work page 2023
- [55]
-
[56]
In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4
Schwarz,K.,Liao,Y.,Niemeyer,M.,Geiger,A.:Graf:Generativeradiancefieldsfor 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 4
work page 2020
-
[57]
Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv (2024) 5
work page 2024
-
[58]
Shih, M.L., Su, S.Y., Kopf, J., Huang, J.B.: 3d photography using context-aware layered depth inpainting. In: CVPR (2020) 5
work page 2020
-
[59]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3. arXiv (2025) 7, 25
work page 2025
-
[60]
Szymanowicz, S., Insafutdinov, E., Zheng, C., Campbell, D., Henriques, J., Rup- precht,C.,Vedaldi,A.:Flash3d:Feed-forwardgeneralisable3dscenereconstruction from a single image. 3DV (2025) 3
work page 2025
-
[61]
Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. In: CVPR (2024) 3
work page 2024
-
[62]
Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: CVPR (2020) 5
work page 2020
-
[63]
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR (2025) 8
work page 2025
-
[64]
Wang, Q., Wang, Z., Genova, K., Srinivasan, P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021) 3
work page 2021
-
[65]
In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19
Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025) 5 OVIE : Monocular Training for In-the-Wild Novel View Generation 19
work page 2025
-
[66]
In: NeurIPS (2025) 5, 7, 9, 24
Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. In: NeurIPS (2025) 5, 7, 9, 24
work page 2025
-
[67]
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR (2024) 4
work page 2024
-
[68]
Wiles,O.,Gkioxari,G.,Szeliski,R.,Johnson,J.:SynSin:End-to-endviewsynthesis from a single image. In: CVPR (2020) 4, 5
work page 2020
-
[69]
Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: ECCV (2025) 4
work page 2025
-
[70]
Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: CVPR (2025) 3
work page 2025
-
[71]
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024) 5
work page 2024
-
[72]
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: Neural radiance fields from one or few images. In: CVPR (2021) 3
work page 2021
-
[73]
In: ICCV (2023) 4, 5, 9, 12, 26
Yu, J.J., Forghani, F., Derpanis, K.G., Brubaker, M.A.: Long-term photometric consistent novel view synthesis with diffusion models. In: ICCV (2023) 4, 5, 9, 12, 26
work page 2023
-
[74]
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE TPAMI (2024) 4
work page 2024
-
[75]
Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Liang, T., Chen, G., Cui, S., Han, X.: Mvimgnet: A large-scale dataset of multi-view images. In: CVPR (2023) 4
work page 2023
-
[76]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 7
work page 2018
-
[77]
Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: ICLR (2026) 25
work page 2026
-
[78]
Zhou, B., Khosla, A., Lapedriza, À., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding. arXiv (2016) 3, 8, 14
work page 2016
-
[79]
Zhou, J.J., Gao, H., Voleti, V., Vasishta, A., Yao, C.H., Boss, M., Torr, P., Rup- precht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. arXiv (2025) 3, 4
work page 2025
-
[80]
SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn- ing view synthesis using multiplane images. SIGGRAPH (2018) 2, 3, 4, 9, 12, 13, 14, 20, 21, 26
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.