pith. sign in

arxiv: 2604.11211 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.LG· cs.MM

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM
keywords real-time view synthesisfeedforward networknovel view interpolationsparse multi-view videodepth estimationocclusion-aware blendingAR/VR rendering
0
0 comments X

The pith

A feedforward network called 3DTV performs real-time novel view synthesis from sparse multi-view video without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces 3DTV as a network that selects three input cameras for any target viewpoint using Delaunay triangulation to maintain good angular coverage. It then runs a pose-aware depth module that builds a coarse-to-fine pyramid of depth maps, allowing efficient reprojection of image features and blending that respects occlusions. Because the entire pipeline runs in a single forward pass, the method produces interpolated views instantly after training and works on new scenes without further adjustment. A sympathetic reader would care because earlier high-quality view synthesis often required slow per-scene optimization, limiting its use in live AR, VR, or telepresence settings. The experiments on multi-view video datasets show that this design maintains competitive image quality while running at interactive speeds and without relying on explicit 3D scene models.

Core claim

3DTV is a feedforward interpolation network for real-time sparse-view synthesis. A Delaunay-based triplet selection ensures angular coverage for each target view. A pose-aware depth module estimates a coarse-to-fine depth pyramid that supports efficient feature reprojection and occlusion-aware blending. The network runs entirely feedforward without retraining or explicit proxies and achieves a practical balance of quality and speed on challenging multi-view video datasets.

What carries the argument

The combination of Delaunay triangulation for selecting input camera triplets and a pose-aware depth module that produces a multi-scale depth pyramid for feature warping and blending.

If this is right

  • Real-time free-viewpoint rendering becomes feasible for interactive AR, VR, and telepresence without offline optimization.
  • The system produces robust results across diverse scenes because it avoids reliance on explicit geometric proxies.
  • Low-latency multi-view streaming and interactive rendering become practical on standard hardware.
  • Quality and efficiency trade-offs improve over prior real-time novel-view baselines on multi-view video data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same triplet-plus-depth-pyramid structure might extend to dynamic scenes if the depth module is replaced by a temporal version.
  • Integration with existing video compression pipelines could further reduce bandwidth for streaming interpolated viewpoints.
  • The feedforward design suggests that lightweight geometric priors can replace heavy optimization in other view-synthesis tasks.

Load-bearing premise

The Delaunay triplet selection always supplies enough angular coverage and the estimated depth maps are accurate enough for occlusion handling without any scene-specific tuning.

What would settle it

Running the method on a scene whose camera layout yields poor triangulation coverage or where depth estimation fails on thin structures or reflections, then measuring whether output quality drops below that of the compared real-time baselines.

Figures

Figures reproduced from arXiv: 2604.11211 by Fernando Edelstein, Hannah Dr\"oge, Markus Plack, Matthias B. Hullin, Stefan Schulz.

Figure 1
Figure 1. Figure 1: We present 3DTV, a real-time method for novel view synthesis from sparse cameras. It com [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our real-time view interpolation framework. A lightweight Ghost-based backbone [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Details on the Correlation Block (left) and Proj Block (right) from Fig. 2. The correlation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on multiple datasets for human capture and non-human containing scenes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generalization behavior of 3DTV across geometry, viewpoint, and resolution. Despite synthetic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Exemplary Delaunay triangulation for the RIFTCast [9] capture stage setup. The scene has been [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of hyperparameter choices and their influence on the resulting triangulation. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (Cont.) Evaluation of hyperparameter choices and their influence on the resulting triangulation. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Snap-Snap produces significant ghosting and duplication artifacts when rendering the frontal view from the RIFTCast [9] dataset [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure analysis of FWD: The model fails [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of our method to online and offline reconstruction methods. (Part 1.1) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of our method to online and offline reconstruction methods. (Part 1.2) [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison of our method to online and offline reconstruction methods. (Part 2.1) [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison of our method to online and offline reconstruction methods. (Part 2.2) [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative evaluation (Part 1) of our model and the different influences of individual parts to [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative evaluation (Part 2) of our model and the different influences of individual parts to [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
read the original abstract

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces 3DTV, a feedforward interpolation network for real-time view synthesis from sparse multi-view inputs. It employs Delaunay-based triplet selection to ensure angular coverage and a pose-aware depth module with a coarse-to-fine pyramid for efficient feature reprojection and occlusion-aware blending. The approach is designed to run without scene-specific optimization or retraining, and the authors report that it achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view synthesis baselines on multi-view video datasets while avoiding explicit proxies for robust rendering across diverse scenes.

Significance. If the experimental results are substantiated, this work could have significant impact on practical applications in AR/VR, telepresence, and interactive rendering by providing a lightweight, feedforward alternative to optimization-heavy methods. The combination of geometric priors (Delaunay selection) with learned components (depth estimation) without requiring per-scene tuning is a promising direction for generalization. However, the current presentation lacks sufficient quantitative evidence to fully evaluate the claims.

major comments (2)
  1. [Abstract and Experiments] The abstract claims that 3DTV 'consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines' and 'enabling robust rendering across diverse scenes,' but provides no quantitative metrics (e.g., PSNR, SSIM, runtime), specific baselines, dataset details, or error analysis. This makes it impossible to verify the outperformance and robustness claims without the full results section.
  2. [Pose-aware depth module] The feedforward claim and robust rendering across diverse scenes rest on the pose-aware depth module (coarse-to-fine pyramid) producing depth maps sufficiently precise for feature reprojection and occlusion-aware blending without scene-specific optimization or explicit proxies. The manuscript should include ablation studies or depth accuracy metrics to demonstrate that depth errors do not lead to visible artifacts in textureless or occluded regions.
minor comments (1)
  1. [Notation] Ensure consistent use of terms like 'pose-aware depth module' and 'coarse-to-fine depth pyramid' throughout the paper to avoid confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better substantiate the claims in the abstract and provide additional evidence for the depth module. We address each point below and have revised the manuscript to incorporate the suggestions where feasible.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The abstract claims that 3DTV 'consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines' and 'enabling robust rendering across diverse scenes,' but provides no quantitative metrics (e.g., PSNR, SSIM, runtime), specific baselines, dataset details, or error analysis. This makes it impossible to verify the outperformance and robustness claims without the full results section.

    Authors: We agree that the abstract would benefit from additional context to support the summarized claims. In the revised manuscript, we have updated the abstract to include brief references to key quantitative results (e.g., average PSNR/SSIM gains and runtime on the evaluated multi-view video datasets) and the main baselines compared. The full experimental details, including specific dataset descriptions, error analysis, and comparisons to methods such as recent real-time NVS baselines, remain in Section 4. This change improves accessibility without lengthening the abstract excessively. revision: yes

  2. Referee: [Pose-aware depth module] The feedforward claim and robust rendering across diverse scenes rest on the pose-aware depth module (coarse-to-fine pyramid) producing depth maps sufficiently precise for feature reprojection and occlusion-aware blending without scene-specific optimization or explicit proxies. The manuscript should include ablation studies or depth accuracy metrics to demonstrate that depth errors do not lead to visible artifacts in textureless or occluded regions.

    Authors: The pose-aware depth module is evaluated through its contribution to end-to-end rendering quality across diverse scenes, as shown in our experiments without per-scene optimization. We acknowledge the value of targeted ablations. In the revised version, we have added an ablation study in Section 4.3 comparing the coarse-to-fine pyramid against a single-scale depth estimator, with qualitative examples and rendering metrics demonstrating reduced artifacts in occluded and textureless regions. Standalone depth accuracy metrics (e.g., absolute depth error) are not reported because ground-truth depth is unavailable for the primary video datasets; our evaluation prioritizes perceptual rendering quality, which indirectly validates the depth module's precision for reprojection and blending. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical architecture validated externally

full rationale

The paper describes 3DTV as a feedforward neural network combining lightweight geometry with learning for sparse-view interpolation. It relies on architectural components (Delaunay triplet selection, pose-aware coarse-to-fine depth pyramid, feature reprojection, occlusion-aware blending) whose validity is asserted via experiments on multi-view video datasets and comparisons to real-time baselines. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations. The method is explicitly positioned as running without scene-specific optimization or retraining, with claims resting on external empirical benchmarks rather than internal tautologies. This matches the default case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about depth estimation accuracy and feature reprojection validity; network weights are learned parameters.

free parameters (1)
  • neural network weights
    Learned during training on multi-view video datasets to enable the interpolation and blending behavior.
axioms (2)
  • domain assumption Delaunay triangulation ensures adequate angular coverage for each target view
    Invoked to justify triplet selection in the abstract.
  • domain assumption Coarse-to-fine depth estimation supports efficient and accurate feature reprojection and occlusion handling
    Basis for the pose-aware depth module described in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1420 out tokens · 45668 ms · 2026-05-10T15:41:12.479282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 1 internal anchor

  1. [1]

    Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

    Boyao Zhou, Shunyuan Zheng, Hanzhang Tu, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

  2. [2]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

  3. [3]

    Instant neural graphics primitives with a multiresolution hash encoding

    Thomas M¨ uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1– 15, 2022

  4. [4]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨ uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

  5. [5]

    Zanjani, Haitam Ben Yahia, Yuki M

    Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, and Amirhossein Habibian. Valid: Variable-length input diffusion for novel view synthesis, 2023

  6. [6]

    Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction

    Zhizhuo Zhou and Shubham Tulsiani. Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction. InCVPR, 2023

  7. [7]

    View interpolation for image synthesis

    Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. InProceed- ings of the 20th Annual Conference on Com- puter Graphics and Interactive Techniques (SIG- GRAPH ’93), pages 279–288. ACM, 1993

  8. [8]

    Nerfstudio: A modular framework for neural radiance field development

    Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InACM SIGGRAPH 2023 Conference Proceed- ings, SIGGRAPH ’23, 2023

  9. [9]

    Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark

    Domenic Zingsheim, Markus Plack, Hannah Dr¨ oge, Janelle Pfeifer, Patrick Stotko, Matthias Hullin, and Reinhard Klein. Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark. InProceed- 11 3DTV: A Feedforward Interpolation Network S.Schulz et al. ings of the 33rd ACM International Conference on Multimedia, 2025

  10. [10]

    Ef- ficient neural radiance fields for interactive free- viewpoint video

    Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Ef- ficient neural radiance fields for interactive free- viewpoint video. InSIGGRAPH Asia Confer- ence Proceedings, 2022

  11. [11]

    Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis

    Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  12. [12]

    Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors

    Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors. CVPR, 2025

  13. [13]

    Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

    Dhruv Mahajan, Fu-Chung Huang, Wojciech Matusik, Ravi Ramamoorthi, and Peter Bel- humeur. Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

  14. [14]

    Frame inter- polation with occlusion detection using a time coherent segmentation

    Rida Sadek, Coloma Ballester, Luis Garrido, En- ric Meinhardt, and Vicent Caselles. Frame inter- polation with occlusion detection using a time coherent segmentation. InInternational Confer- ence on Computer Vision Theory and Applica- tions, volume 2, pages 367–372. SCITEPRESS, 2012

  15. [15]

    Motion compensated frame interpolation with a symmetric optical flow constraint

    Lars Lau Rakˆ et, Lars Roholm, Andr´ es Bruhn, and Joachim Weickert. Motion compensated frame interpolation with a symmetric optical flow constraint. InInternational Symposium on Visual Computing, pages 447–457. Springer, 2012

  16. [16]

    Phase-based frame interpolation for video

    Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse, and Alexander Sorkine-Hornung. Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1418, 2015

  17. [17]

    Learning image matching by simply watching video

    Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. InEuropean Conference on Computer Vi- sion, pages 434–450. Springer, 2016

  18. [18]

    Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion

    Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018

  19. [19]

    Video frame interpolation via adaptive separable con- volution

    Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable con- volution. InProceedings of the IEEE interna- tional conference on computer vision, pages 261– 270, 2017

  20. [20]

    A flexible recurrent residual pyramid net- work for video frame interpolation

    Haoxian Zhang, Yang Zhao, and Ronggang Wang. A flexible recurrent residual pyramid net- work for video frame interpolation. InEuropean conference on computer vision, pages 474–491. Springer, 2020

  21. [21]

    Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation

    Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation. InEuropean conference on computer vision, pages 109–125. Springer, 2020

  22. [22]

    Long- term video frame interpolation via feature prop- agation, 2022

    Dawit Mureja Argaw and In So Kweon. Long- term video frame interpolation via feature prop- agation, 2022

  23. [23]

    Film: Frame interpolation for large motion, 2022

    Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Cur- less. Film: Frame interpolation for large motion, 2022

  24. [24]

    Video frame interpolation with transformer, 2022

    Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. Video frame interpolation with transformer, 2022

  25. [25]

    Motion-aware video frame interpolation, 2024

    Pengfei Han, Fuhua Zhang, Bin Zhao, and Xue- long Li. Motion-aware video frame interpolation, 2024

  26. [26]

    Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

    Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

  27. [27]

    Phasenet for video frame interpolation

    Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. Phasenet for video frame interpolation. InPro- ceedings of the IEEE Conference on Computer 12 3DTV: A Feedforward Interpolation Network S.Schulz et al. Vision and Pattern Recognition, pages 498–507, 2018

  28. [28]

    Hierarchical flow diffusion for efficient frame interpolation, 2025

    Yang Hai, Guo Wang, Tan Su, Wenjie Jiang, and Yinlin Hu. Hierarchical flow diffusion for efficient frame interpolation, 2025

  29. [29]

    Eden: Enhanced diffusion for high-quality large- motion video frame interpolation

    Zihao Zhang, Haoran Chen, Haoyu Zhao, Guan- song Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large- motion video frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025

  30. [30]

    Time-adaptive video frame in- terpolation based on residual diffusion, 2025

    Victor Fonte Chavez, Claudia Esteves, and Jean- Bernard Hayet. Time-adaptive video frame in- terpolation based on residual diffusion, 2025

  31. [31]

    Depth-aware video frame interpolation

    Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 3703– 3712, 2019

  32. [32]

    A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

    Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

  33. [33]

    Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

  34. [34]

    Using multi- ple hypotheses to improve depth-maps for multi- view stereo

    Neill DF Campbell, George Vogiatzis, Carlos Hern´ andez, and Roberto Cipolla. Using multi- ple hypotheses to improve depth-maps for multi- view stereo. InEuropean conference on computer vision, pages 766–779. Springer, 2008

  35. [35]

    Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

    Engin Tola, Christoph Strecha, and Pascal Fua. Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

  36. [36]

    Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

  37. [37]

    Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

    Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

  38. [38]

    Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

    Xiaodong Gu, Zhiwen Fan, Zuozhuo Dai, Siyu Zhu, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

  39. [39]

    Cost volume pyramid based depth inference for multi-view stereo

    Jiayu Yang, Wei Mao, Jose M Alvarez, and Miaomiao Liu. Cost volume pyramid based depth inference for multi-view stereo. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4877–4886, 2020

  40. [40]

    Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness

    Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 2524– 2534, 2020

  41. [41]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Syn- naeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  42. [42]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  43. [43]

    A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

    Kai Han, Yunhe Wang, Hanting Chen, Xing- hao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

  44. [44]

    Transmvsnet: Global context- aware multi-view stereo network with transform- ers

    Yikang Ding, Wentao Yuan, Qingtian Zhu, Hao- tian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context- aware multi-view stereo network with transform- ers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8585–8594, 2022

  45. [45]

    Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

    Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, and Kai Zhang. Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

  46. [46]

    Multi-view stereo with transformer, 2021

    Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, and Jianjun Lei. Multi-view stereo with transformer, 2021. 13 3DTV: A Feedforward Interpolation Network S.Schulz et al

  47. [47]

    Mvster: Epipolar transformer for efficient multi-view stereo, 2022

    Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi, Yijia He, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo, 2022

  48. [48]

    Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

    Sicheng Wang, Hao Jiang, and Lei Xiang. Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

  49. [49]

    Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

    Shaoqian Wang, Xiaokun Ding, Yuxin Mao, and Yuchao Dai. Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

  50. [50]

    Rc-mvsnet: Unsupervised multi-view stereo with neural rendering

    Di Chang, Aljaˇ z Boˇ ziˇ c, Tong Zhang, Qingsong Yan, Yingcong Chen, Sabine S¨ usstrunk, and Matthias Nießner. Rc-mvsnet: Unsupervised multi-view stereo with neural rendering. InPro- ceedings of the European conference on computer vision (ECCV), 2022

  51. [51]

    Nope-nerf: Optimising neural radiance field with no pose prior

    Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, pages 4160–4169, 2023

  52. [52]

    Halluci- nated neural radiance fields in the wild

    Xingyu Chen, Qi Zhang, Xiaoyu Li, Yue Chen, Ying Feng, Xuan Wang, and Jue Wang. Halluci- nated neural radiance fields in the wild. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12943–12952, 2022

  53. [53]

    Plenoxels: Radiance fields with- out neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields with- out neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022

  54. [54]

    Nerf in the wild: Neural radiance fields for unconstrained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

  55. [55]

    Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis

    Youngkyoon Jang and Eduardo P´ erez-Pellitero. Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

  56. [56]

    Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion

    Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 20775–20785, 2024

  57. [57]

    Coherentgs: Sparse novel view synthesis with coherent 3d gaussians

    Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, and Nima Khademi Kalantari. Coherentgs: Sparse novel view synthesis with coherent 3d gaussians. InEuropean Conference on Computer Vision, pages 19–37. Springer, 2024

  58. [58]

    Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints

    Shen Chen, Jiale Zhou, and Lei Li. Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints. InICASSP 2025- 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  59. [59]

    InstantSplat: Sparse-view gaussian splatting in seconds.arXiv preprint arXiv:2403.20309, 2024

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Unbounded sparse- view pose-free gaussian splatting in 40 seconds. arXiv preprint arXiv:2403.20309, 2(3):4, 2024

  60. [60]

    Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  61. [61]

    Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives

    Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 21537–21546, 2025

  62. [62]

    Compgs: Smaller and faster gaussian splatting with vector quantization

    KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, and Hamed Pir- siavash. Compgs: Smaller and faster gaussian splatting with vector quantization. InEuropean Conference on Computer Vision, pages 330–349. Springer, 2024. 14 3DTV: A Feedforward Interpolation Network S.Schulz et al

  63. [63]

    Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

  64. [64]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

  65. [65]

    Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

    Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, and Rares Ambrus. Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

  66. [66]

    Bolt3d: Generating 3d scenes in seconds,

    Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. arXiv:2503.14445, 2025

  67. [67]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin- Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view syn- thesis with diffusion models.arXiv preprint arXiv:2210.04628, 2022

  68. [68]

    Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

    Noam Elata, Bahjat Kawar, Yaron Ostrovsky- Berman, Miriam Farber, and Ron Sokolovsky. Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

  69. [69]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

  70. [70]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021

  71. [71]

    Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

    Ang Cao, Chris Rockwell, and Justin Johnson. Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

  72. [72]

    Fast and explicit neural view synthesis

    Pengsheng Guo, Miguel Angel Bautista, Alex Colburn, Liang Yang, Daniel Ulbricht, Joshua M Susskind, and Qi Shan. Fast and explicit neural view synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3791–3800, 2022

  73. [73]

    Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

    Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, and Xinggang Wang. Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

  74. [74]

    Fast, mini- mum storage ray/triangle intersection

    Tomas M¨ oller and Ben Trumbore. Fast, mini- mum storage ray/triangle intersection. InACM SIGGRAPH 2005 Courses, SIGGRAPH ’05, page 7–es, New York, NY, USA, 2005. Associ- ation for Computing Machinery

  75. [75]

    Ghost- net: More features from cheap operations

    Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghost- net: More features from cheap operations. In 2020 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 1577–1586, 2020

  76. [76]

    Ghostnetv2: enhance cheap operation with long-range atten- tion

    Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: enhance cheap operation with long-range atten- tion. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc

  77. [77]

    Le, and Hartwig Adam

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vi- jay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3, 2019

  78. [78]

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convo- lutional nets and fully connected crfs, 2016

  79. [79]

    Group-wise cor- relation stereo network

    Xiaoyang Guo, Kai Yang, Wukui Yang, Xiao- gang Wang, and Hongsheng Li. Group-wise cor- relation stereo network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3273–3282, 2019

  80. [80]

    Perceptual losses for real-time style transfer and super-resolution, 2016

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

Showing first 80 references.