3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Fernando Edelstein; Hannah Dr\"oge; Markus Plack; Matthias B. Hullin; Stefan Schulz

arxiv: 2604.11211 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.LG· cs.MM

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Stefan Schulz , Fernando Edelstein , Hannah Dr\"oge , Matthias B. Hullin , Markus Plack This is my paper

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM

keywords real-time view synthesisfeedforward networknovel view interpolationsparse multi-view videodepth estimationocclusion-aware blendingAR/VR rendering

0 comments

The pith

A feedforward network called 3DTV performs real-time novel view synthesis from sparse multi-view video without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces 3DTV as a network that selects three input cameras for any target viewpoint using Delaunay triangulation to maintain good angular coverage. It then runs a pose-aware depth module that builds a coarse-to-fine pyramid of depth maps, allowing efficient reprojection of image features and blending that respects occlusions. Because the entire pipeline runs in a single forward pass, the method produces interpolated views instantly after training and works on new scenes without further adjustment. A sympathetic reader would care because earlier high-quality view synthesis often required slow per-scene optimization, limiting its use in live AR, VR, or telepresence settings. The experiments on multi-view video datasets show that this design maintains competitive image quality while running at interactive speeds and without relying on explicit 3D scene models.

Core claim

3DTV is a feedforward interpolation network for real-time sparse-view synthesis. A Delaunay-based triplet selection ensures angular coverage for each target view. A pose-aware depth module estimates a coarse-to-fine depth pyramid that supports efficient feature reprojection and occlusion-aware blending. The network runs entirely feedforward without retraining or explicit proxies and achieves a practical balance of quality and speed on challenging multi-view video datasets.

What carries the argument

The combination of Delaunay triangulation for selecting input camera triplets and a pose-aware depth module that produces a multi-scale depth pyramid for feature warping and blending.

If this is right

Real-time free-viewpoint rendering becomes feasible for interactive AR, VR, and telepresence without offline optimization.
The system produces robust results across diverse scenes because it avoids reliance on explicit geometric proxies.
Low-latency multi-view streaming and interactive rendering become practical on standard hardware.
Quality and efficiency trade-offs improve over prior real-time novel-view baselines on multi-view video data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same triplet-plus-depth-pyramid structure might extend to dynamic scenes if the depth module is replaced by a temporal version.
Integration with existing video compression pipelines could further reduce bandwidth for streaming interpolated viewpoints.
The feedforward design suggests that lightweight geometric priors can replace heavy optimization in other view-synthesis tasks.

Load-bearing premise

The Delaunay triplet selection always supplies enough angular coverage and the estimated depth maps are accurate enough for occlusion handling without any scene-specific tuning.

What would settle it

Running the method on a scene whose camera layout yields poor triangulation coverage or where depth estimation fails on thin structures or reflections, then measuring whether output quality drops below that of the compared real-time baselines.

Figures

Figures reproduced from arXiv: 2604.11211 by Fernando Edelstein, Hannah Dr\"oge, Markus Plack, Matthias B. Hullin, Stefan Schulz.

**Figure 2.** Figure 2: Overview of our real-time view interpolation framework. A lightweight Ghost-based backbone [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Details on the Correlation Block (left) and Proj Block (right) from Fig. 2. The correlation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on multiple datasets for human capture and non-human containing scenes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization behavior of 3DTV across geometry, viewpoint, and resolution. Despite synthetic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Exemplary Delaunay triangulation for the RIFTCast [9] capture stage setup. The scene has been [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation of hyperparameter choices and their influence on the resulting triangulation. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: (Cont.) Evaluation of hyperparameter choices and their influence on the resulting triangulation. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Snap-Snap produces significant ghosting and duplication artifacts when rendering the frontal view from the RIFTCast [9] dataset [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 11.** Figure 11: Failure analysis of FWD: The model fails [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of our method to online and offline reconstruction methods. (Part 1.1) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of our method to online and offline reconstruction methods. (Part 1.2) [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of our method to online and offline reconstruction methods. (Part 2.1) [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of our method to online and offline reconstruction methods. (Part 2.2) [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative evaluation (Part 1) of our model and the different influences of individual parts to [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative evaluation (Part 2) of our model and the different influences of individual parts to [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

read the original abstract

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

3DTV combines Delaunay triplet selection with a pose-aware depth pyramid in a feedforward network for real-time sparse-view synthesis without per-scene training or explicit proxies.

read the letter

The key takeaway is that 3DTV is a feedforward network for real-time sparse-view interpolation that uses Delaunay-based triplet selection and a pose-aware depth pyramid to enable occlusion-aware blending without any scene-specific training or explicit proxies. What is new here is the integration of these elements into a single pipeline optimized for low latency. The Delaunay step selects views with good angular coverage, and the depth module creates a coarse-to-fine pyramid to support accurate feature reprojection and blending. This setup is presented as practical for AR/VR and telepresence, where per-scene optimization would be too slow. The paper does a good job highlighting the balance between quality and efficiency in multi-view video settings. Avoiding explicit proxies is a nice practical feature if the depth estimates hold up across diverse scenes. The main soft spot is that the abstract asserts consistent outperformance over recent real-time baselines but provides no numbers, error analysis, or even basic dataset information. This makes it hard to verify the claims without the full results. The stress-test point about the depth module needing to be precise enough in tricky areas like textureless regions or occlusions is relevant; any weaknesses there would directly affect the no-proxy robustness. I would want to see if the experiments include such cases or just average performance. This paper is for readers interested in applied novel view synthesis, particularly those building systems that need to run in real time without heavy computation per scene. It could be valuable for someone prototyping interactive rendering pipelines. I would recommend sending it to peer review. The approach is clear and the problem it targets is important, even if more evidence on the quantitative side is needed.

Referee Report

2 major / 1 minor

Summary. The paper introduces 3DTV, a feedforward interpolation network for real-time view synthesis from sparse multi-view inputs. It employs Delaunay-based triplet selection to ensure angular coverage and a pose-aware depth module with a coarse-to-fine pyramid for efficient feature reprojection and occlusion-aware blending. The approach is designed to run without scene-specific optimization or retraining, and the authors report that it achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view synthesis baselines on multi-view video datasets while avoiding explicit proxies for robust rendering across diverse scenes.

Significance. If the experimental results are substantiated, this work could have significant impact on practical applications in AR/VR, telepresence, and interactive rendering by providing a lightweight, feedforward alternative to optimization-heavy methods. The combination of geometric priors (Delaunay selection) with learned components (depth estimation) without requiring per-scene tuning is a promising direction for generalization. However, the current presentation lacks sufficient quantitative evidence to fully evaluate the claims.

major comments (2)

[Abstract and Experiments] The abstract claims that 3DTV 'consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines' and 'enabling robust rendering across diverse scenes,' but provides no quantitative metrics (e.g., PSNR, SSIM, runtime), specific baselines, dataset details, or error analysis. This makes it impossible to verify the outperformance and robustness claims without the full results section.
[Pose-aware depth module] The feedforward claim and robust rendering across diverse scenes rest on the pose-aware depth module (coarse-to-fine pyramid) producing depth maps sufficiently precise for feature reprojection and occlusion-aware blending without scene-specific optimization or explicit proxies. The manuscript should include ablation studies or depth accuracy metrics to demonstrate that depth errors do not lead to visible artifacts in textureless or occluded regions.

minor comments (1)

[Notation] Ensure consistent use of terms like 'pose-aware depth module' and 'coarse-to-fine depth pyramid' throughout the paper to avoid confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better substantiate the claims in the abstract and provide additional evidence for the depth module. We address each point below and have revised the manuscript to incorporate the suggestions where feasible.

read point-by-point responses

Referee: [Abstract and Experiments] The abstract claims that 3DTV 'consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines' and 'enabling robust rendering across diverse scenes,' but provides no quantitative metrics (e.g., PSNR, SSIM, runtime), specific baselines, dataset details, or error analysis. This makes it impossible to verify the outperformance and robustness claims without the full results section.

Authors: We agree that the abstract would benefit from additional context to support the summarized claims. In the revised manuscript, we have updated the abstract to include brief references to key quantitative results (e.g., average PSNR/SSIM gains and runtime on the evaluated multi-view video datasets) and the main baselines compared. The full experimental details, including specific dataset descriptions, error analysis, and comparisons to methods such as recent real-time NVS baselines, remain in Section 4. This change improves accessibility without lengthening the abstract excessively. revision: yes
Referee: [Pose-aware depth module] The feedforward claim and robust rendering across diverse scenes rest on the pose-aware depth module (coarse-to-fine pyramid) producing depth maps sufficiently precise for feature reprojection and occlusion-aware blending without scene-specific optimization or explicit proxies. The manuscript should include ablation studies or depth accuracy metrics to demonstrate that depth errors do not lead to visible artifacts in textureless or occluded regions.

Authors: The pose-aware depth module is evaluated through its contribution to end-to-end rendering quality across diverse scenes, as shown in our experiments without per-scene optimization. We acknowledge the value of targeted ablations. In the revised version, we have added an ablation study in Section 4.3 comparing the coarse-to-fine pyramid against a single-scale depth estimator, with qualitative examples and rendering metrics demonstrating reduced artifacts in occluded and textureless regions. Standalone depth accuracy metrics (e.g., absolute depth error) are not reported because ground-truth depth is unavailable for the primary video datasets; our evaluation prioritizes perceptual rendering quality, which indirectly validates the depth module's precision for reprojection and blending. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical architecture validated externally

full rationale

The paper describes 3DTV as a feedforward neural network combining lightweight geometry with learning for sparse-view interpolation. It relies on architectural components (Delaunay triplet selection, pose-aware coarse-to-fine depth pyramid, feature reprojection, occlusion-aware blending) whose validity is asserted via experiments on multi-view video datasets and comparisons to real-time baselines. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations. The method is explicitly positioned as running without scene-specific optimization or retraining, with claims resting on external empirical benchmarks rather than internal tautologies. This matches the default case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about depth estimation accuracy and feature reprojection validity; network weights are learned parameters.

free parameters (1)

neural network weights
Learned during training on multi-view video datasets to enable the interpolation and blending behavior.

axioms (2)

domain assumption Delaunay triangulation ensures adequate angular coverage for each target view
Invoked to justify triplet selection in the abstract.
domain assumption Coarse-to-fine depth estimation supports efficient and accurate feature reprojection and occlusion handling
Basis for the pose-aware depth module described in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1420 out tokens · 45668 ms · 2026-05-10T15:41:12.479282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 1 internal anchor

[1]

Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

Boyao Zhou, Shunyuan Zheng, Hanzhang Tu, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

work page arXiv 2024
[2]

Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

work page 2021
[3]

Instant neural graphics primitives with a multiresolution hash encoding

Thomas M¨ uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1– 15, 2022

work page 2022
[4]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨ uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[5]

Zanjani, Haitam Ben Yahia, Yuki M

Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, and Amirhossein Habibian. Valid: Variable-length input diffusion for novel view synthesis, 2023

work page 2023
[6]

Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction

Zhizhuo Zhou and Shubham Tulsiani. Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction. InCVPR, 2023

work page 2023
[7]

View interpolation for image synthesis

Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. InProceed- ings of the 20th Annual Conference on Com- puter Graphics and Interactive Techniques (SIG- GRAPH ’93), pages 279–288. ACM, 1993

work page 1993
[8]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InACM SIGGRAPH 2023 Conference Proceed- ings, SIGGRAPH ’23, 2023

work page 2023
[9]

Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark

Domenic Zingsheim, Markus Plack, Hannah Dr¨ oge, Janelle Pfeifer, Patrick Stotko, Matthias Hullin, and Reinhard Klein. Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark. InProceed- 11 3DTV: A Feedforward Interpolation Network S.Schulz et al. ings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025
[10]

Ef- ficient neural radiance fields for interactive free- viewpoint video

Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Ef- ficient neural radiance fields for interactive free- viewpoint video. InSIGGRAPH Asia Confer- ence Proceedings, 2022

work page 2022
[11]

Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[12]

Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors

Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors. CVPR, 2025

work page 2025
[13]

Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

Dhruv Mahajan, Fu-Chung Huang, Wojciech Matusik, Ravi Ramamoorthi, and Peter Bel- humeur. Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

work page 2009
[14]

Frame inter- polation with occlusion detection using a time coherent segmentation

Rida Sadek, Coloma Ballester, Luis Garrido, En- ric Meinhardt, and Vicent Caselles. Frame inter- polation with occlusion detection using a time coherent segmentation. InInternational Confer- ence on Computer Vision Theory and Applica- tions, volume 2, pages 367–372. SCITEPRESS, 2012

work page 2012
[15]

Motion compensated frame interpolation with a symmetric optical flow constraint

Lars Lau Rakˆ et, Lars Roholm, Andr´ es Bruhn, and Joachim Weickert. Motion compensated frame interpolation with a symmetric optical flow constraint. InInternational Symposium on Visual Computing, pages 447–457. Springer, 2012

work page 2012
[16]

Phase-based frame interpolation for video

Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse, and Alexander Sorkine-Hornung. Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1418, 2015

work page 2015
[17]

Learning image matching by simply watching video

Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. InEuropean Conference on Computer Vi- sion, pages 434–450. Springer, 2016

work page 2016
[18]

Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion

Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018

work page 2018
[19]

Video frame interpolation via adaptive separable con- volution

Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable con- volution. InProceedings of the IEEE interna- tional conference on computer vision, pages 261– 270, 2017

work page 2017
[20]

A flexible recurrent residual pyramid net- work for video frame interpolation

Haoxian Zhang, Yang Zhao, and Ronggang Wang. A flexible recurrent residual pyramid net- work for video frame interpolation. InEuropean conference on computer vision, pages 474–491. Springer, 2020

work page 2020
[21]

Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation

Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation. InEuropean conference on computer vision, pages 109–125. Springer, 2020

work page 2020
[22]

Long- term video frame interpolation via feature prop- agation, 2022

Dawit Mureja Argaw and In So Kweon. Long- term video frame interpolation via feature prop- agation, 2022

work page 2022
[23]

Film: Frame interpolation for large motion, 2022

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Cur- less. Film: Frame interpolation for large motion, 2022

work page 2022
[24]

Video frame interpolation with transformer, 2022

Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. Video frame interpolation with transformer, 2022

work page 2022
[25]

Motion-aware video frame interpolation, 2024

Pengfei Han, Fuhua Zhang, Bin Zhao, and Xue- long Li. Motion-aware video frame interpolation, 2024

work page 2024
[26]

Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

work page 2024
[27]

Phasenet for video frame interpolation

Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. Phasenet for video frame interpolation. InPro- ceedings of the IEEE Conference on Computer 12 3DTV: A Feedforward Interpolation Network S.Schulz et al. Vision and Pattern Recognition, pages 498–507, 2018

work page 2018
[28]

Hierarchical flow diffusion for efficient frame interpolation, 2025

Yang Hai, Guo Wang, Tan Su, Wenjie Jiang, and Yinlin Hu. Hierarchical flow diffusion for efficient frame interpolation, 2025

work page 2025
[29]

Eden: Enhanced diffusion for high-quality large- motion video frame interpolation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guan- song Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large- motion video frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025

work page 2025
[30]

Time-adaptive video frame in- terpolation based on residual diffusion, 2025

Victor Fonte Chavez, Claudia Esteves, and Jean- Bernard Hayet. Time-adaptive video frame in- terpolation based on residual diffusion, 2025

work page 2025
[31]

Depth-aware video frame interpolation

Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 3703– 3712, 2019

work page 2019
[32]

A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

work page 2000
[33]

Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

work page 2009
[34]

Using multi- ple hypotheses to improve depth-maps for multi- view stereo

Neill DF Campbell, George Vogiatzis, Carlos Hern´ andez, and Roberto Cipolla. Using multi- ple hypotheses to improve depth-maps for multi- view stereo. InEuropean conference on computer vision, pages 766–779. Springer, 2008

work page 2008
[35]

Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

Engin Tola, Christoph Strecha, and Pascal Fua. Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

work page 2012
[36]

Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

work page 2018
[37]

Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[38]

Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

Xiaodong Gu, Zhiwen Fan, Zuozhuo Dai, Siyu Zhu, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

work page 2020
[39]

Cost volume pyramid based depth inference for multi-view stereo

Jiayu Yang, Wei Mao, Jose M Alvarez, and Miaomiao Liu. Cost volume pyramid based depth inference for multi-view stereo. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4877–4886, 2020

work page 2020
[40]

Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness

Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 2524– 2534, 2020

work page 2020
[41]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Syn- naeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

work page 2020
[42]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[43]

A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

Kai Han, Yunhe Wang, Hanting Chen, Xing- hao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

work page 2022
[44]

Transmvsnet: Global context- aware multi-view stereo network with transform- ers

Yikang Ding, Wentao Yuan, Qingtian Zhu, Hao- tian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context- aware multi-view stereo network with transform- ers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8585–8594, 2022

work page 2022
[45]

Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, and Kai Zhang. Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

work page 2022
[46]

Multi-view stereo with transformer, 2021

Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, and Jianjun Lei. Multi-view stereo with transformer, 2021. 13 3DTV: A Feedforward Interpolation Network S.Schulz et al

work page 2021
[47]

Mvster: Epipolar transformer for efficient multi-view stereo, 2022

Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi, Yijia He, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo, 2022

work page 2022
[48]

Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

Sicheng Wang, Hao Jiang, and Lei Xiang. Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

work page 2024
[49]

Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

Shaoqian Wang, Xiaokun Ding, Yuxin Mao, and Yuchao Dai. Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

work page 2025
[50]

Rc-mvsnet: Unsupervised multi-view stereo with neural rendering

Di Chang, Aljaˇ z Boˇ ziˇ c, Tong Zhang, Qingsong Yan, Yingcong Chen, Sabine S¨ usstrunk, and Matthias Nießner. Rc-mvsnet: Unsupervised multi-view stereo with neural rendering. InPro- ceedings of the European conference on computer vision (ECCV), 2022

work page 2022
[51]

Nope-nerf: Optimising neural radiance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, pages 4160–4169, 2023

work page 2023
[52]

Halluci- nated neural radiance fields in the wild

Xingyu Chen, Qi Zhang, Xiaoyu Li, Yue Chen, Ying Feng, Xuan Wang, and Jue Wang. Halluci- nated neural radiance fields in the wild. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12943–12952, 2022

work page 2022
[53]

Plenoxels: Radiance fields with- out neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields with- out neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022

work page 2022
[54]

Nerf in the wild: Neural radiance fields for unconstrained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

work page 2021
[55]

Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis

Youngkyoon Jang and Eduardo P´ erez-Pellitero. Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

work page 2025
[56]

Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 20775–20785, 2024

work page 2024
[57]

Coherentgs: Sparse novel view synthesis with coherent 3d gaussians

Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, and Nima Khademi Kalantari. Coherentgs: Sparse novel view synthesis with coherent 3d gaussians. InEuropean Conference on Computer Vision, pages 19–37. Springer, 2024

work page 2024
[58]

Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints

Shen Chen, Jiale Zhou, and Lei Li. Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints. InICASSP 2025- 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[59]

InstantSplat: Sparse-view gaussian splatting in seconds.arXiv preprint arXiv:2403.20309, 2024

Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Unbounded sparse- view pose-free gaussian splatting in 40 seconds. arXiv preprint arXiv:2403.20309, 2(3):4, 2024

work page arXiv 2024
[60]

Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[61]

Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives

Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 21537–21546, 2025

work page 2025
[62]

Compgs: Smaller and faster gaussian splatting with vector quantization

KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, and Hamed Pir- siavash. Compgs: Smaller and faster gaussian splatting with vector quantization. InEuropean Conference on Computer Vision, pages 330–349. Springer, 2024. 14 3DTV: A Feedforward Interpolation Network S.Schulz et al

work page 2024
[63]

Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[64]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022
[65]

Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, and Rares Ambrus. Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

work page 2025
[66]

Bolt3d: Generating 3d scenes in seconds,

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. arXiv:2503.14445, 2025

work page arXiv 2025
[67]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin- Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view syn- thesis with diffusion models.arXiv preprint arXiv:2210.04628, 2022

work page arXiv 2022
[68]

Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

Noam Elata, Bahjat Kawar, Yaron Ostrovsky- Berman, Miriam Farber, and Ron Sokolovsky. Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

work page arXiv 2024
[69]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

work page 2021
[70]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021

work page 2021
[71]

Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

Ang Cao, Chris Rockwell, and Justin Johnson. Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

work page 2022
[72]

Fast and explicit neural view synthesis

Pengsheng Guo, Miguel Angel Bautista, Alex Colburn, Liang Yang, Daniel Ulbricht, Joshua M Susskind, and Qi Shan. Fast and explicit neural view synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3791–3800, 2022

work page 2022
[73]

Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, and Xinggang Wang. Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

work page 2025
[74]

Fast, mini- mum storage ray/triangle intersection

Tomas M¨ oller and Ben Trumbore. Fast, mini- mum storage ray/triangle intersection. InACM SIGGRAPH 2005 Courses, SIGGRAPH ’05, page 7–es, New York, NY, USA, 2005. Associ- ation for Computing Machinery

work page 2005
[75]

Ghost- net: More features from cheap operations

Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghost- net: More features from cheap operations. In 2020 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 1577–1586, 2020

work page 2020
[76]

Ghostnetv2: enhance cheap operation with long-range atten- tion

Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: enhance cheap operation with long-range atten- tion. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc

work page 2022
[77]

Le, and Hartwig Adam

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vi- jay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3, 2019

work page 2019
[78]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convo- lutional nets and fully connected crfs, 2016

work page 2016
[79]

Group-wise cor- relation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiao- gang Wang, and Hongsheng Li. Group-wise cor- relation stereo network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3273–3282, 2019

work page 2019
[80]

Perceptual losses for real-time style transfer and super-resolution, 2016

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

work page 2016

Showing first 80 references.

[1] [1]

Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

Boyao Zhou, Shunyuan Zheng, Hanzhang Tu, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian+: Generalizable pixel-wise 3d gaussian splatting for real-time human-scene rendering from sparse views.arXiv preprint arXiv:2411.11363, 2024

work page arXiv 2024

[2] [2]

Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Com- munications of the ACM, 65(1):99–106, 2021

work page 2021

[3] [3]

Instant neural graphics primitives with a multiresolution hash encoding

Thomas M¨ uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1– 15, 2022

work page 2022

[4] [4]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨ uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023

[5] [5]

Zanjani, Haitam Ben Yahia, Yuki M

Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, and Amirhossein Habibian. Valid: Variable-length input diffusion for novel view synthesis, 2023

work page 2023

[6] [6]

Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction

Zhizhuo Zhou and Shubham Tulsiani. Sparsefu- sion: Distilling view-conditioned diffusion for 3d reconstruction. InCVPR, 2023

work page 2023

[7] [7]

View interpolation for image synthesis

Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. InProceed- ings of the 20th Annual Conference on Com- puter Graphics and Interactive Techniques (SIG- GRAPH ’93), pages 279–288. ACM, 1993

work page 1993

[8] [8]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InACM SIGGRAPH 2023 Conference Proceed- ings, SIGGRAPH ’23, 2023

work page 2023

[9] [9]

Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark

Domenic Zingsheim, Markus Plack, Hannah Dr¨ oge, Janelle Pfeifer, Patrick Stotko, Matthias Hullin, and Reinhard Klein. Riftcast: A template-free end-to-end multi-view live telep- resence framework and benchmark. InProceed- 11 3DTV: A Feedforward Interpolation Network S.Schulz et al. ings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025

[10] [10]

Ef- ficient neural radiance fields for interactive free- viewpoint video

Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Ef- ficient neural radiance fields for interactive free- viewpoint video. InSIGGRAPH Asia Confer- ence Proceedings, 2022

work page 2022

[11] [11]

Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian: Generalizable pixel- wise 3d gaussian splatting for real-time hu- man novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[12] [12]

Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors

Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast convergence for extreme few- shot novel view synthesis without learned priors. CVPR, 2025

work page 2025

[13] [13]

Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

Dhruv Mahajan, Fu-Chung Huang, Wojciech Matusik, Ravi Ramamoorthi, and Peter Bel- humeur. Moving gradients: a path-based method for plausible image interpolation.ACM Transactions on Graphics (TOG), 28(3):1–11, 2009

work page 2009

[14] [14]

Frame inter- polation with occlusion detection using a time coherent segmentation

Rida Sadek, Coloma Ballester, Luis Garrido, En- ric Meinhardt, and Vicent Caselles. Frame inter- polation with occlusion detection using a time coherent segmentation. InInternational Confer- ence on Computer Vision Theory and Applica- tions, volume 2, pages 367–372. SCITEPRESS, 2012

work page 2012

[15] [15]

Motion compensated frame interpolation with a symmetric optical flow constraint

Lars Lau Rakˆ et, Lars Roholm, Andr´ es Bruhn, and Joachim Weickert. Motion compensated frame interpolation with a symmetric optical flow constraint. InInternational Symposium on Visual Computing, pages 447–457. Springer, 2012

work page 2012

[16] [16]

Phase-based frame interpolation for video

Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse, and Alexander Sorkine-Hornung. Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1418, 2015

work page 2015

[17] [17]

Learning image matching by simply watching video

Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. InEuropean Conference on Computer Vi- sion, pages 434–450. Springer, 2016

work page 2016

[18] [18]

Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion

Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpola- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018

work page 2018

[19] [19]

Video frame interpolation via adaptive separable con- volution

Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable con- volution. InProceedings of the IEEE interna- tional conference on computer vision, pages 261– 270, 2017

work page 2017

[20] [20]

A flexible recurrent residual pyramid net- work for video frame interpolation

Haoxian Zhang, Yang Zhao, and Ronggang Wang. A flexible recurrent residual pyramid net- work for video frame interpolation. InEuropean conference on computer vision, pages 474–491. Springer, 2020

work page 2020

[21] [21]

Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation

Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estima- tion with bilateral cost volume for video inter- polation. InEuropean conference on computer vision, pages 109–125. Springer, 2020

work page 2020

[22] [22]

Long- term video frame interpolation via feature prop- agation, 2022

Dawit Mureja Argaw and In So Kweon. Long- term video frame interpolation via feature prop- agation, 2022

work page 2022

[23] [23]

Film: Frame interpolation for large motion, 2022

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Cur- less. Film: Frame interpolation for large motion, 2022

work page 2022

[24] [24]

Video frame interpolation with transformer, 2022

Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. Video frame interpolation with transformer, 2022

work page 2022

[25] [25]

Motion-aware video frame interpolation, 2024

Pengfei Han, Fuhua Zhang, Bin Zhao, and Xue- long Li. Motion-aware video frame interpolation, 2024

work page 2024

[26] [26]

Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim-vfi: directional motion field-guided frame interpolation for video with non-uniform mo- tions, 2024

work page 2024

[27] [27]

Phasenet for video frame interpolation

Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. Phasenet for video frame interpolation. InPro- ceedings of the IEEE Conference on Computer 12 3DTV: A Feedforward Interpolation Network S.Schulz et al. Vision and Pattern Recognition, pages 498–507, 2018

work page 2018

[28] [28]

Hierarchical flow diffusion for efficient frame interpolation, 2025

Yang Hai, Guo Wang, Tan Su, Wenjie Jiang, and Yinlin Hu. Hierarchical flow diffusion for efficient frame interpolation, 2025

work page 2025

[29] [29]

Eden: Enhanced diffusion for high-quality large- motion video frame interpolation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guan- song Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large- motion video frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025

work page 2025

[30] [30]

Time-adaptive video frame in- terpolation based on residual diffusion, 2025

Victor Fonte Chavez, Claudia Esteves, and Jean- Bernard Hayet. Time-adaptive video frame in- terpolation based on residual diffusion, 2025

work page 2025

[31] [31]

Depth-aware video frame interpolation

Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 3703– 3712, 2019

work page 2019

[32] [32]

A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving.International journal of computer vision, 38(3):199–218, 2000

work page 2000

[33] [33]

Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine in- telligence, 32(8):1362–1376, 2009

work page 2009

[34] [34]

Using multi- ple hypotheses to improve depth-maps for multi- view stereo

Neill DF Campbell, George Vogiatzis, Carlos Hern´ andez, and Roberto Cipolla. Using multi- ple hypotheses to improve depth-maps for multi- view stereo. InEuropean conference on computer vision, pages 766–779. Springer, 2008

work page 2008

[35] [35]

Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

Engin Tola, Christoph Strecha, and Pascal Fua. Efficient large-scale multi-view stereo for ultra high-resolution image sets.Machine Vision and Applications, 23(5):903–920, 2012

work page 2012

[36] [36]

Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for un- structured multi-view stereo.European Confer- ence on Computer Vision (ECCV), 2018

work page 2018

[37] [37]

Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth infer- ence.Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[38] [38]

Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

Xiaodong Gu, Zhiwen Fan, Zuozhuo Dai, Siyu Zhu, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching, 2020

work page 2020

[39] [39]

Cost volume pyramid based depth inference for multi-view stereo

Jiayu Yang, Wei Mao, Jose M Alvarez, and Miaomiao Liu. Cost volume pyramid based depth inference for multi-view stereo. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4877–4886, 2020

work page 2020

[40] [40]

Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness

Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adaptive thin volume rep- resentation with uncertainty awareness. InPro- ceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 2524– 2534, 2020

work page 2020

[41] [41]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Syn- naeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

work page 2020

[42] [42]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[43] [43]

A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

Kai Han, Yunhe Wang, Hanting Chen, Xing- hao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer.IEEE trans- actions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022

work page 2022

[44] [44]

Transmvsnet: Global context- aware multi-view stereo network with transform- ers

Yikang Ding, Wentao Yuan, Qingtian Zhu, Hao- tian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context- aware multi-view stereo network with transform- ers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8585–8594, 2022

work page 2022

[45] [45]

Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, and Kai Zhang. Wt-mvsnet: Window-based trans- formers for multi-view stereo, 2022

work page 2022

[46] [46]

Multi-view stereo with transformer, 2021

Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, and Jianjun Lei. Multi-view stereo with transformer, 2021. 13 3DTV: A Feedforward Interpolation Network S.Schulz et al

work page 2021

[47] [47]

Mvster: Epipolar transformer for efficient multi-view stereo, 2022

Xiaofeng Wang, Zheng Zhu, Fangbo Qin, Yun Ye, Guan Huang, Xu Chi, Yijia He, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo, 2022

work page 2022

[48] [48]

Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

Sicheng Wang, Hao Jiang, and Lei Xiang. Ct- mvsnet: Efficient multi-view stereo with cross- scale transformer, 2024

work page 2024

[49] [49]

Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

Shaoqian Wang, Xiaokun Ding, Yuxin Mao, and Yuchao Dai. Etv-mvs: Robust visibility- aware multi-view stereo with epipolar line-based transformer.Big Data Mining and Analytics, 8(3):520–533, 2025

work page 2025

[50] [50]

Rc-mvsnet: Unsupervised multi-view stereo with neural rendering

Di Chang, Aljaˇ z Boˇ ziˇ c, Tong Zhang, Qingsong Yan, Yingcong Chen, Sabine S¨ usstrunk, and Matthias Nießner. Rc-mvsnet: Unsupervised multi-view stereo with neural rendering. InPro- ceedings of the European conference on computer vision (ECCV), 2022

work page 2022

[51] [51]

Nope-nerf: Optimising neural radiance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recogni- tion, pages 4160–4169, 2023

work page 2023

[52] [52]

Halluci- nated neural radiance fields in the wild

Xingyu Chen, Qi Zhang, Xiaoyu Li, Yue Chen, Ying Feng, Xuan Wang, and Jue Wang. Halluci- nated neural radiance fields in the wild. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12943–12952, 2022

work page 2022

[53] [53]

Plenoxels: Radiance fields with- out neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields with- out neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022

work page 2022

[54] [54]

Nerf in the wild: Neural radiance fields for unconstrained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021

work page 2021

[55] [55]

Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis

Youngkyoon Jang and Eduardo P´ erez-Pellitero. Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis. InPro- ceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

work page 2025

[56] [56]

Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaus- sian: Optimizing sparse-view 3d gaussian ra- diance fields with global-local depth normaliza- tion. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 20775–20785, 2024

work page 2024

[57] [57]

Coherentgs: Sparse novel view synthesis with coherent 3d gaussians

Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, and Nima Khademi Kalantari. Coherentgs: Sparse novel view synthesis with coherent 3d gaussians. InEuropean Conference on Computer Vision, pages 19–37. Springer, 2024

work page 2024

[58] [58]

Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints

Shen Chen, Jiale Zhou, and Lei Li. Dense point clouds matter: Dust-gs for scene reconstruc- tion from sparse viewpoints. InICASSP 2025- 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[59] [59]

InstantSplat: Sparse-view gaussian splatting in seconds.arXiv preprint arXiv:2403.20309, 2024

Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Unbounded sparse- view pose-free gaussian splatting in 40 seconds. arXiv preprint arXiv:2403.20309, 2(3):4, 2024

work page arXiv 2024

[60] [60]

Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra- fast single-view 3d reconstruction.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[61] [61]

Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives

Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, and Tom Goldstein. Speedy- splat: Fast 3d gaussian splatting with sparse pix- els and sparse primitives. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 21537–21546, 2025

work page 2025

[62] [62]

Compgs: Smaller and faster gaussian splatting with vector quantization

KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, and Hamed Pir- siavash. Compgs: Smaller and faster gaussian splatting with vector quantization. InEuropean Conference on Computer Vision, pages 330–349. Springer, 2024. 14 3DTV: A Feedforward Interpolation Network S.Schulz et al

work page 2024

[63] [63]

Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[64] [64]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022

[65] [65]

Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, and Rares Ambrus. Zero-shot novel view and depth synthesis with multi-view geometric diffusion, 2025

work page 2025

[66] [66]

Bolt3d: Generating 3d scenes in seconds,

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. arXiv:2503.14445, 2025

work page arXiv 2025

[67] [67]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin- Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view syn- thesis with diffusion models.arXiv preprint arXiv:2210.04628, 2022

work page arXiv 2022

[68] [68]

Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

Noam Elata, Bahjat Kawar, Yaron Ostrovsky- Berman, Miriam Farber, and Ron Sokolovsky. Novel view synthesis with pixel-space diffusion models.arXiv preprint arXiv:2411.07765, 2024

work page arXiv 2024

[69] [69]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

work page 2021

[70] [70]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021

work page 2021

[71] [71]

Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

Ang Cao, Chris Rockwell, and Justin Johnson. Fwd: Real-time novel view synthesis with for- ward warping and depth, 2022

work page 2022

[72] [72]

Fast and explicit neural view synthesis

Pengsheng Guo, Miguel Angel Bautista, Alex Colburn, Liang Yang, Daniel Ulbricht, Joshua M Susskind, and Qi Shan. Fast and explicit neural view synthesis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3791–3800, 2022

work page 2022

[73] [73]

Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, and Xinggang Wang. Snap-snap: Taking two im- ages to reconstruct 3d human gaussians in mil- liseconds, 2025

work page 2025

[74] [74]

Fast, mini- mum storage ray/triangle intersection

Tomas M¨ oller and Ben Trumbore. Fast, mini- mum storage ray/triangle intersection. InACM SIGGRAPH 2005 Courses, SIGGRAPH ’05, page 7–es, New York, NY, USA, 2005. Associ- ation for Computing Machinery

work page 2005

[75] [75]

Ghost- net: More features from cheap operations

Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghost- net: More features from cheap operations. In 2020 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 1577–1586, 2020

work page 2020

[76] [76]

Ghostnetv2: enhance cheap operation with long-range atten- tion

Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: enhance cheap operation with long-range atten- tion. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc

work page 2022

[77] [77]

Le, and Hartwig Adam

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vi- jay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3, 2019

work page 2019

[78] [78]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convo- lutional nets and fully connected crfs, 2016

work page 2016

[79] [79]

Group-wise cor- relation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiao- gang Wang, and Hongsheng Li. Group-wise cor- relation stereo network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3273–3282, 2019

work page 2019

[80] [80]

Perceptual losses for real-time style transfer and super-resolution, 2016

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

work page 2016