pith. machine review for the scientific record. sign in

arxiv: 2511.20853 · v3 · submitted 2025-11-25 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

Recognition: 2 theorem links

· Lean Theorem

MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGeess.IV
keywords stereo datasetdepth estimationdepth of fieldDSLRoptical effectscamera calibrationreal-world datacomputer vision
0
0 comments X

The pith

The authors introduce the first high-resolution stereo DSLR dataset with 18000 images that systematically varies focal length and aperture across real scenes to capture professional camera optics for depth tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that depth estimation and related vision tasks suffer from poor generalization because existing datasets lack the optical complexity of real professional cameras. It addresses this by releasing MODEST, a collection of 18000 high-resolution stereo images captured with two identical DSLR assemblies. The images cover nine scenes at ten focal lengths and five apertures, yielding fifty distinct optical configurations per scene along with dedicated calibration sets. A sympathetic reader would care because the dataset supplies controlled real data for studying how focus and aperture changes affect monocular depth, stereo matching, deblurring, and 3D reconstruction. If the claim holds, researchers gain a testbed that more closely matches actual camera behavior than synthetic alternatives.

Core claim

The central claim is that MODEST supplies the first large-scale, high-resolution (5472 by 3648 pixels) stereo DSLR dataset in which focal length and aperture are varied systematically across complex real scenes. For each of nine scenes the authors record 2000 images using two matched camera rigs at focal lengths from 28 mm to 70 mm and apertures from f/2.8 to f/22, producing fifty optical configurations together with calibration images for every configuration. The scenes include reflective surfaces, transparent glass, mirrors, fine detail, and mixed lighting so that geometric and optical effects can be isolated for monocular and stereo depth estimation, shallow depth-of-field rendering, debl

What carries the argument

The central object is the dual identical DSLR capture protocol that records synchronized stereo pairs while stepping through ten focal lengths and five apertures for each scene, paired with per-configuration calibration images that enable separate analysis of geometric and optical influences.

If this is right

  • Controlled analysis of how focal length and aperture affect monocular and stereo depth estimation becomes feasible on real data.
  • Classical and learning-based intrinsic and extrinsic calibration methods can be evaluated across fifty optical configurations.
  • Current state-of-the-art monocular, stereo depth, and depth-of-field methods can be tested against documented real optical challenges.
  • Research on shallow depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis gains a real-world benchmark with calibration support.
  • The realism gap between synthetic training data and actual professional camera optics can be measured and reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained on MODEST may generalize more reliably to professional camera inputs in robotics and augmented reality than models trained only on synthetic or fixed-optics data.
  • Explicit modeling of varying depth of field could become a standard component of depth pipelines rather than an afterthought.
  • Extensions that add temporal sequences or additional camera brands would test whether the current nine scenes already capture the essential optical variations.
  • Vision algorithms might shift from assuming fixed pinhole optics toward pipelines that ingest focal length and aperture metadata as first-class inputs.

Load-bearing premise

The nine chosen scenes and the two identical camera assemblies sufficiently represent the diversity and optical complexity of real professional camera use without unaccounted capture artifacts or selection biases.

What would settle it

If depth estimation models trained on synthetic data achieve accuracy on independent real DSLR captures that matches or exceeds accuracy on MODEST, or if the optical effects in the dataset prove reproducible by simple pinhole models without the recorded parameter changes, the claim of unique optical realism would be challenged.

Figures

Figures reproduced from arXiv: 2511.20853 by Li-Yun Wang, Nisarg K. Trivedi, Vinayak A. Belludi.

Figure 1
Figure 1. Figure 1: MODEST dataset. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SOTA 4 monocular and 4 stereo depth estimation models are evaluated on 3 image pairs with optical illusions and as evident, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3D scene reconstruction results from two approaches: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Shallow depth-of-field (DoF) rendering across different scene angles. There are three lens blurring models used [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study across focal lengths (top row) and apertures (bottom row) for state-of-the-art four monocular and four stereo depth [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of three state-of-the-art deblurring methods on three images from Scene 1 of our dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of 3D scene reconstruction results among [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Splatt3R [35] results highlighting Gaussian [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MODEST dataset: the first high-resolution (5472×3648 px) stereo DSLR dataset with 18,000 images captured across 9 complex real scenes. It systematically varies 10 focal lengths (28-70 mm) and 5 apertures (f/2.8 to f/22) using two identical camera assemblies, providing 2000 images per scene along with dedicated calibration sets for each configuration. The dataset aims to support research on monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis under realistic optical conditions, including challenging elements like reflections, transparencies, and optical illusions.

Significance. If the captured data accurately represents professional camera optics without unaccounted artifacts, this dataset would fill an important gap in real-world high-fidelity stereo data for computer vision. It enables controlled studies of geometric and optical effects across a wide range of focal and aperture settings, which is currently limited by synthetic data or smaller real datasets. The public release of images, calibrations, and evaluation code promotes reproducible research on optical generalization.

major comments (1)
  1. [Abstract] The headline claim of capturing 'the optical realism and complexity of professional camera systems' across 'complex real scenes' rests on a selection of only 9 scenes described qualitatively by complexity, lighting, and background. No quantitative metrics of scene diversity (such as depth histograms, material coverage, or lighting variation statistics) or comparisons to standard scene benchmarks are provided, which is load-bearing for asserting that the dataset bridges the realism gap for broad professional use.
minor comments (2)
  1. [Abstract] The resolution is specified as 5472×3648px, but it would improve clarity to explicitly state whether this applies to each image in the stereo pair or if there is any downsampling involved in the release.
  2. [Abstract] The abstract states that the work 'demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods,' but does not reference specific quantitative results, tables, or figures where these demonstrations are shown.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the MODEST dataset to address gaps in real-world optical data. We address the single major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The headline claim of capturing 'the optical realism and complexity of professional camera systems' across 'complex real scenes' rests on a selection of only 9 scenes described qualitatively by complexity, lighting, and background. No quantitative metrics of scene diversity (such as depth histograms, material coverage, or lighting variation statistics) or comparisons to standard scene benchmarks are provided, which is load-bearing for asserting that the dataset bridges the realism gap for broad professional use.

    Authors: We agree that the current manuscript describes the nine scenes primarily through qualitative attributes (varying complexity, lighting, and background) and specific challenging elements such as reflections, transparencies, mirrors, and optical illusions. Quantitative metrics of scene diversity are indeed absent, which limits the strength of claims about broad realism and generalization. In the revised manuscript we will add a dedicated subsection on scene characterization that includes: (1) categorical statistics (e.g., number of scenes containing reflective surfaces, transparent elements, fine-grained textures, and multi-scale illusions), (2) basic lighting variation measures derived from image histograms and exposure metadata, and (3) a comparison table contrasting key scene properties against common benchmarks such as KITTI, Middlebury, and NYU Depth V2. Because the dataset consists of real-world captures without dense ground-truth depth, we will not fabricate depth histograms; instead we will report statistics on estimated depth ranges obtained from the stereo pairs using a standard baseline method, clearly labeled as such. These additions will be placed in Section 3 and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or predictions

full rationale

The paper is a data-release contribution describing capture of 18000 high-resolution stereo images across 9 scenes with systematic variation in focal length and aperture. No equations, models, predictions, or fitted parameters appear in the provided text or abstract. The central claims rest on the empirical description of the capture protocol and scene selection rather than any derivation chain that could reduce to self-definition or self-citation. This is the expected non-finding for a dataset paper whose value is independent of internal mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard practices in camera calibration and scene capture rather than new theoretical constructs. No free parameters are fitted to data in the claim, and no new entities are postulated.

axioms (1)
  • standard math Standard camera calibration models for intrinsics and extrinsics apply to the dedicated calibration image sets
    Invoked when stating that each focal configuration has a dedicated calibration image set supporting evaluation of classical and learning-based methods.

pith-pipeline@v0.9.0 · 5627 in / 1329 out tokens · 41553 ms · 2026-05-17T04:09:46.562195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Dpdd: A deep photographic defocus dataset

    Abdullah Abuolaim, Abhijith Punnappurath, and Michael S Brown. Dpdd: A deep photographic defocus dataset. In 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 611–619. IEEE, 2019

  2. [2]

    Richter, and Vladlen Koltun

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second, 2024

  3. [3]

    A naturalistic open source movie for op- tical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for op- tical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012

  4. [4]

    OpenMVS: Multi-view stereo reconstruction library

    Dan Cernea. OpenMVS: Multi-view stereo reconstruction library. 2020

  5. [5]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017

  6. [6]

    Monster: Marry monodepth to stereo unleashes power

    Junda Cheng, Xueqin Wang, Wei Wang, Lei Zhu, Jian Liu, Xinyu Li, and Others. Monster: Marry monodepth to stereo unleashes power. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 10186–10196, 2025

  7. [7]

    Polarimetric multi-view stereo

    Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, and Jan Kautz. Polarimetric multi-view stereo. In2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 369–378, 2017

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  9. [9]

    Virtual kitti: A synthetic dataset for evaluating stereo and optical flow

    Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual kitti: A synthetic dataset for evaluating stereo and optical flow. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10. IEEE, 2016

  10. [10]

    Ddad: A real-world dataset for unsupervised deep-learning-based depth and ego-motion estimation

    Suman Garg, Qiao Wang, Siyuan Chen, Yanan Liu, Yux- uan Li, Yujie Wang, Wenqiang Zhang, Raquel Urtasun, and Yukun Li. Ddad: A real-world dataset for unsupervised deep-learning-based depth and ego-motion estimation. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7089–7095. IEEE, 2020

  11. [11]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012

  12. [12]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

  13. [13]

    Vabd: A video aberration and blur dataset

    Thomas Huang, Fu-Jen Tung, Yirui Sun, and Michael S Brown. Vabd: A video aberration and blur dataset. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 25016–25026. IEEE, 2024

  14. [14]

    Rendering natural camera bokeh effect with deep learning

    Andrey Ignatov, Jagruti Patel, and Radu Timofte. Rendering natural camera bokeh effect with deep learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 418–419, 2020

  15. [15]

    Aim 2020 challenge on render- ing realistic bokeh

    Andrey Ignatov, Radu Timofte, Ming Qian, Congyu Qiao, Jiamin Lin, Zhenyu Guo, Chenghua Li, Cong Leng, Jian Cheng, Juewen Peng, et al. Aim 2020 challenge on render- ing realistic bokeh. InEuropean Conference on Computer Vision, pages 213–228. Springer, 2020

  16. [16]

    Secret lies in color: Enhancing ai-generated images detection with color distribution anal- ysis

    Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xi- aoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jin- chao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution anal- ysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13445–13454, 2025

  17. [17]

    DEFOM-Stereo: Depth foun- dation model based stereo matching

    Hualie Jiang, Zexian Lou, Li Ding, Rui Xu, Mingtan Tan, Wei Jiang, and Rong Huang. DEFOM-Stereo: Depth foun- dation model based stereo matching. InIEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  18. [18]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025

  19. [19]

    3d gaussian splatting for real-time radiance field rendering, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023

  20. [20]

    Evaluation of cnn-based single-image depth estimation methods

    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. InProceedings of the European Con- ference on Computer Vision (ECCV) Workshops, pages 0–0, 2018

  21. [21]

    ibims-1: A dataset for rigid multi-view stereo

    Tobias Koch, Christian Hane, Johannes Jordan, and Friedrich Fraundorfer. ibims-1: A dataset for rigid multi-view stereo. In2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7065–7071. IEEE, 2020

  22. [22]

    Efficient frequency domain-based trans- formers for high-quality image deblurring

    Lingshun Kong, Jiangxin Dong, Jianjun Ge, Mingqiang Li, and Jinshan Pan. Efficient frequency domain-based trans- formers for high-quality image deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5886–5895, 2023

  23. [23]

    Efficient visual state space model for image deblurring

    Lingshun Kong, Jiangxin Dong, Jinhui Tang, Ming-Hsuan Yang, and Jinshan Pan. Efficient visual state space model for image deblurring. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12710–12719, 2025

  24. [24]

    Hammer: a large-scale, hand-object, multi- view, temporally-and-spatially-annotated dataset

    Hamid Laga, Sutanu Jati, Ilaria Falco, Simone Melzi, Marco Manzo, Freek Stulp, Umberto Castellani, Antti Oulasvirta, and Chi Ren. Hammer: a large-scale, hand-object, multi- view, temporally-and-spatially-annotated dataset. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21008–21017. IEEE, 2022

  25. [25]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: 12 Representing scenes as neural radiance fields for view syn- thesis, 2020

  26. [26]

    Deep multi-scale convolutional neural network for dynamic scene deblurring

    Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2671–2680, 2017

  27. [27]

    Bokehme: When neural rendering meets classical rendering

    Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. Bokehme: When neural rendering meets classical rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  28. [28]

    Bokehme: When neural rendering meets classical rendering

    Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. Bokehme: When neural rendering meets classical rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16283–16292, 2022

  29. [29]

    Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

  30. [30]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016

  31. [31]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

  32. [32]

    Eth3d: A benchmark for multi-view stereo

    Thomas Sch ¨ops, Johannes L Sch ¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. Eth3d: A benchmark for multi-view stereo. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3954–3963. IEEE, 2017

  33. [33]

    Yichen Sheng, Zixun Yu, Lu Ling, Zhiwen Cao, Xuaner Zhang, Xin Lu, Ke Xian, Haiting Lin, and Bedrich Benes. Dr. bokeh: Differentiable occlusion-aware bokeh rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4515–4525, 2024

  34. [34]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

  35. [35]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

  36. [36]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Manel Galindo, Dhruv Jayaraman, Sudeep Ra- makrishnan, Daniel Gordon, Richard Newcombe, Georgia Gkioxari, and Jitendra Malik. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

  37. [37]

    A bench- mark for the evaluation of rgb-d slam systems

    J ¨urgen Sturm, Jakob Engel, and Daniel Cremers. A bench- mark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580. IEEE, 2012

  38. [38]

    Diode: A dense indoor and outdoor depth dataset

    Igor Vlasic, Maria Shugrina, Or Litany, Angela Dai, and Matthias Nießner. Diode: A dense indoor and outdoor depth dataset. In2019 International Conference on 3D Vision (3DV), pages 310–320. IEEE, 2019

  39. [39]

    V oid: A new dataset and a baseline for void region filling

    Lei Wang, Jian-Fang Zhang, Yebin Wang, Kun Yu, Yizhou Liu, and Tian Wu. V oid: A new dataset and a baseline for void region filling. InIEEE Transactions on Pattern Analysis and Machine Intelligence, pages 3155–3169. IEEE, 2020

  40. [40]

    Selective-stereo: Adaptive frequency information selection for stereo matching

    Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19701–19710, 2024

  41. [42]

    Foundationstereo: Zero- shot stereo matching.CVPR, 2025

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

  42. [43]

    Unsuper- vised monocular depth learning in dynamic scenes

    Alex Wong, Wei-Chih Chiu, and Stefano Soatto. Unsuper- vised monocular depth learning in dynamic scenes. InCon- ference on Robot Learning, pages 1016–1031. PMLR, 2020

  43. [44]

    nlmvs-net: Deep non-lambertian multi-view stereo

    Kohei Yamashita, Yuto Enyo, Shohei Nobuhara, and Ko Nishino. nlmvs-net: Deep non-lambertian multi-view stereo. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), 2023

  44. [45]

    Depth any- thing v2, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2, 2024

  45. [46]

    A practical 3d reconstruc- tion method for weak texture scenes.Remote Sensing, 13 (16), 2021

    Xuyuan Yang and Guang Jiang. A practical 3d reconstruc- tion method for weak texture scenes.Remote Sensing, 13 (16), 2021

  46. [47]

    3d visual illusion depth estimation.arXiv preprint arXiv:2505.13061, 2025

    Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, and Yunde Jia. 3d visual illusion depth estimation.arXiv preprint arXiv:2505.13061, 2025

  47. [48]

    Scan- net++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Shubham Tulsiani, Ishan Nerurkar, Georgia Gkioxari, Jitendra Malik, and Angela Dai. Scan- net++: A high-fidelity dataset of 3d indoor scenes. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 20997–21007. IEEE, 2024

  48. [49]

    Restormer: Efficient transformer for high-resolution image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

  49. [50]

    Blur-aware lens blur synthesis

    Jhih-Ciang Zheng, Fu-Jen Tung, and Michael S Brown. Blur-aware lens blur synthesis. In2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 17845–17854. IEEE, 2022

  50. [51]

    Bokehdiff: Neu- ral lens blur with one-step diffusion

    Chengxuan Zhu, Qingnan Fan, Qi Zhang, Jinwei Chen, Huaqi Zhang, Chao Xu, and Boxin Shi. Bokehdiff: Neu- ral lens blur with one-step diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9508–9518, 2025. 13