arxiv: 2511.20853 · v3 · submitted 2025-11-25 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

Recognition: 2 theorem links

· Lean Theorem

MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi , Vinayak A. Belludi , Li-Yun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGeess.IV

keywords stereo datasetdepth estimationdepth of fieldDSLRoptical effectscamera calibrationreal-world datacomputer vision

0 comments

The pith

The authors introduce the first high-resolution stereo DSLR dataset with 18000 images that systematically varies focal length and aperture across real scenes to capture professional camera optics for depth tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that depth estimation and related vision tasks suffer from poor generalization because existing datasets lack the optical complexity of real professional cameras. It addresses this by releasing MODEST, a collection of 18000 high-resolution stereo images captured with two identical DSLR assemblies. The images cover nine scenes at ten focal lengths and five apertures, yielding fifty distinct optical configurations per scene along with dedicated calibration sets. A sympathetic reader would care because the dataset supplies controlled real data for studying how focus and aperture changes affect monocular depth, stereo matching, deblurring, and 3D reconstruction. If the claim holds, researchers gain a testbed that more closely matches actual camera behavior than synthetic alternatives.

Core claim

The central claim is that MODEST supplies the first large-scale, high-resolution (5472 by 3648 pixels) stereo DSLR dataset in which focal length and aperture are varied systematically across complex real scenes. For each of nine scenes the authors record 2000 images using two matched camera rigs at focal lengths from 28 mm to 70 mm and apertures from f/2.8 to f/22, producing fifty optical configurations together with calibration images for every configuration. The scenes include reflective surfaces, transparent glass, mirrors, fine detail, and mixed lighting so that geometric and optical effects can be isolated for monocular and stereo depth estimation, shallow depth-of-field rendering, debl

What carries the argument

The central object is the dual identical DSLR capture protocol that records synchronized stereo pairs while stepping through ten focal lengths and five apertures for each scene, paired with per-configuration calibration images that enable separate analysis of geometric and optical influences.

If this is right

Controlled analysis of how focal length and aperture affect monocular and stereo depth estimation becomes feasible on real data.
Classical and learning-based intrinsic and extrinsic calibration methods can be evaluated across fifty optical configurations.
Current state-of-the-art monocular, stereo depth, and depth-of-field methods can be tested against documented real optical challenges.
Research on shallow depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis gains a real-world benchmark with calibration support.
The realism gap between synthetic training data and actual professional camera optics can be measured and reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on MODEST may generalize more reliably to professional camera inputs in robotics and augmented reality than models trained only on synthetic or fixed-optics data.
Explicit modeling of varying depth of field could become a standard component of depth pipelines rather than an afterthought.
Extensions that add temporal sequences or additional camera brands would test whether the current nine scenes already capture the essential optical variations.
Vision algorithms might shift from assuming fixed pinhole optics toward pipelines that ingest focal length and aperture metadata as first-class inputs.

Load-bearing premise

The nine chosen scenes and the two identical camera assemblies sufficiently represent the diversity and optical complexity of real professional camera use without unaccounted capture artifacts or selection biases.

What would settle it

If depth estimation models trained on synthetic data achieve accuracy on independent real DSLR captures that matches or exceeds accuracy on MODEST, or if the optical effects in the dataset prove reproducible by simple pinhole models without the recorded parameter changes, the claim of unique optical realism would be challenged.

Figures

Figures reproduced from arXiv: 2511.20853 by Li-Yun Wang, Nisarg K. Trivedi, Vinayak A. Belludi.

**Figure 1.** Figure 1: MODEST dataset. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: SOTA 4 monocular and 4 stereo depth estimation models are evaluated on 3 image pairs with optical illusions and as evident, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: 3D scene reconstruction results from two approaches: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Shallow depth-of-field (DoF) rendering across different scene angles. There are three lens blurring models used [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study across focal lengths (top row) and apertures (bottom row) for state-of-the-art four monocular and four stereo depth [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of three state-of-the-art deblurring methods on three images from Scene 1 of our dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of 3D scene reconstruction results among [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Splatt3R [35] results highlighting Gaussian [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MODEST releases a new real stereo DSLR dataset with systematic focal and aperture variation across nine scenes, which is a practical step forward even if the scene count limits how far the generality claims can stretch.

read the letter

The main thing to know is that this paper puts out a real high-resolution stereo dataset captured on DSLR hardware with full-range changes in focal length and aperture. That combination on actual scenes is new compared to the synthetic or narrower real sets referenced in the abstract. The release includes 18,000 images at 5472 by 3648, ten focal lengths from 28 to 70 mm, five apertures from f/2.8 to f/22, and dedicated calibration sets for each configuration. Releasing the images, calibrations, and evaluation code is the right practical move for anyone who wants to test depth estimation or depth-of-field methods on genuine optical effects rather than approximations. The scenes also include reflections, glass, mirrors, and optical illusions, which adds useful challenge for real-world robustness checks. The abstract notes that the data shows current methods struggling, which is the kind of concrete demonstration a dataset paper should provide. The soft spot is the nine scenes. They are described by qualitative differences in complexity, lighting, and background, but the abstract gives no quantitative diversity metrics or external benchmarks, so it is not yet clear how well they stand in for broader professional camera use. The fixed two-camera assembly could also introduce consistent capture traits that do not vary. This work is aimed at computer vision groups working on depth, deblurring, or 3D reconstruction who need real optics data to check generalization from synthetic training. A reader who wants controlled real-world examples for those tasks will get direct value. It deserves a serious referee because the data scale and optics coverage are substantial and the release supports reproducibility. I would send it for peer review so reviewers can examine the actual images and any supporting evaluations in the full paper.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MODEST dataset: the first high-resolution (5472×3648 px) stereo DSLR dataset with 18,000 images captured across 9 complex real scenes. It systematically varies 10 focal lengths (28-70 mm) and 5 apertures (f/2.8 to f/22) using two identical camera assemblies, providing 2000 images per scene along with dedicated calibration sets for each configuration. The dataset aims to support research on monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis under realistic optical conditions, including challenging elements like reflections, transparencies, and optical illusions.

Significance. If the captured data accurately represents professional camera optics without unaccounted artifacts, this dataset would fill an important gap in real-world high-fidelity stereo data for computer vision. It enables controlled studies of geometric and optical effects across a wide range of focal and aperture settings, which is currently limited by synthetic data or smaller real datasets. The public release of images, calibrations, and evaluation code promotes reproducible research on optical generalization.

major comments (1)

[Abstract] The headline claim of capturing 'the optical realism and complexity of professional camera systems' across 'complex real scenes' rests on a selection of only 9 scenes described qualitatively by complexity, lighting, and background. No quantitative metrics of scene diversity (such as depth histograms, material coverage, or lighting variation statistics) or comparisons to standard scene benchmarks are provided, which is load-bearing for asserting that the dataset bridges the realism gap for broad professional use.

minor comments (2)

[Abstract] The resolution is specified as 5472×3648px, but it would improve clarity to explicitly state whether this applies to each image in the stereo pair or if there is any downsampling involved in the release.
[Abstract] The abstract states that the work 'demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods,' but does not reference specific quantitative results, tables, or figures where these demonstrations are shown.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the MODEST dataset to address gaps in real-world optical data. We address the single major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The headline claim of capturing 'the optical realism and complexity of professional camera systems' across 'complex real scenes' rests on a selection of only 9 scenes described qualitatively by complexity, lighting, and background. No quantitative metrics of scene diversity (such as depth histograms, material coverage, or lighting variation statistics) or comparisons to standard scene benchmarks are provided, which is load-bearing for asserting that the dataset bridges the realism gap for broad professional use.

Authors: We agree that the current manuscript describes the nine scenes primarily through qualitative attributes (varying complexity, lighting, and background) and specific challenging elements such as reflections, transparencies, mirrors, and optical illusions. Quantitative metrics of scene diversity are indeed absent, which limits the strength of claims about broad realism and generalization. In the revised manuscript we will add a dedicated subsection on scene characterization that includes: (1) categorical statistics (e.g., number of scenes containing reflective surfaces, transparent elements, fine-grained textures, and multi-scale illusions), (2) basic lighting variation measures derived from image histograms and exposure metadata, and (3) a comparison table contrasting key scene properties against common benchmarks such as KITTI, Middlebury, and NYU Depth V2. Because the dataset consists of real-world captures without dense ground-truth depth, we will not fabricate depth histograms; instead we will report statistics on estimated depth ranges obtained from the stereo pairs using a standard baseline method, clearly labeled as such. These additions will be placed in Section 3 and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or predictions

full rationale

The paper is a data-release contribution describing capture of 18000 high-resolution stereo images across 9 scenes with systematic variation in focal length and aperture. No equations, models, predictions, or fitted parameters appear in the provided text or abstract. The central claims rest on the empirical description of the capture protocol and scene selection rather than any derivation chain that could reduce to self-definition or self-citation. This is the expected non-finding for a dataset paper whose value is independent of internal mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard practices in camera calibration and scene capture rather than new theoretical constructs. No free parameters are fitted to data in the claim, and no new entities are postulated.

axioms (1)

standard math Standard camera calibration models for intrinsics and extrinsics apply to the dedicated calibration image sets
Invoked when stating that each focal configuration has a dedicated calibration image set supporting evaluation of classical and learning-based methods.

pith-pipeline@v0.9.0 · 5627 in / 1329 out tokens · 41553 ms · 2026-05-17T04:09:46.562195+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the first high-resolution (5472×3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

[1]

Dpdd: A deep photographic defocus dataset

Abdullah Abuolaim, Abhijith Punnappurath, and Michael S Brown. Dpdd: A deep photographic defocus dataset. In 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 611–619. IEEE, 2019

work page 2019
[2]

Richter, and Vladlen Koltun

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second, 2024

work page 2024
[3]

A naturalistic open source movie for op- tical flow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for op- tical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012

work page 2012
[4]

OpenMVS: Multi-view stereo reconstruction library

Dan Cernea. OpenMVS: Multi-view stereo reconstruction library. 2020

work page 2020
[5]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017

work page 2017
[6]

Monster: Marry monodepth to stereo unleashes power

Junda Cheng, Xueqin Wang, Wei Wang, Lei Zhu, Jian Liu, Xinyu Li, and Others. Monster: Marry monodepth to stereo unleashes power. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 10186–10196, 2025

work page 2025
[7]

Polarimetric multi-view stereo

Zhaopeng Cui, Jinwei Gu, Boxin Shi, Ping Tan, and Jan Kautz. Polarimetric multi-view stereo. In2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 369–378, 2017

work page 2017
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[9]

Virtual kitti: A synthetic dataset for evaluating stereo and optical flow

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual kitti: A synthetic dataset for evaluating stereo and optical flow. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10. IEEE, 2016

work page 2016
[10]

Ddad: A real-world dataset for unsupervised deep-learning-based depth and ego-motion estimation

Suman Garg, Qiao Wang, Siyuan Chen, Yanan Liu, Yux- uan Li, Yujie Wang, Wenqiang Zhang, Raquel Urtasun, and Yukun Li. Ddad: A real-world dataset for unsupervised deep-learning-based depth and ego-motion estimation. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7089–7095. IEEE, 2020

work page 2020
[11]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012

work page 2012
[12]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

work page 2024
[13]

Vabd: A video aberration and blur dataset

Thomas Huang, Fu-Jen Tung, Yirui Sun, and Michael S Brown. Vabd: A video aberration and blur dataset. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 25016–25026. IEEE, 2024

work page 2024
[14]

Rendering natural camera bokeh effect with deep learning

Andrey Ignatov, Jagruti Patel, and Radu Timofte. Rendering natural camera bokeh effect with deep learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 418–419, 2020

work page 2020
[15]

Aim 2020 challenge on render- ing realistic bokeh

Andrey Ignatov, Radu Timofte, Ming Qian, Congyu Qiao, Jiamin Lin, Zhenyu Guo, Chenghua Li, Cong Leng, Jian Cheng, Juewen Peng, et al. Aim 2020 challenge on render- ing realistic bokeh. InEuropean Conference on Computer Vision, pages 213–228. Springer, 2020

work page 2020
[16]

Secret lies in color: Enhancing ai-generated images detection with color distribution anal- ysis

Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xi- aoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jin- chao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution anal- ysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13445–13454, 2025

work page 2025
[17]

DEFOM-Stereo: Depth foun- dation model based stereo matching

Hualie Jiang, Zexian Lou, Li Ding, Rui Xu, Mingtan Tan, Wei Jiang, and Rong Huang. DEFOM-Stereo: Depth foun- dation model based stereo matching. InIEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[18]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023

work page 2023
[20]

Evaluation of cnn-based single-image depth estimation methods

Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. InProceedings of the European Con- ference on Computer Vision (ECCV) Workshops, pages 0–0, 2018

work page 2018
[21]

ibims-1: A dataset for rigid multi-view stereo

Tobias Koch, Christian Hane, Johannes Jordan, and Friedrich Fraundorfer. ibims-1: A dataset for rigid multi-view stereo. In2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7065–7071. IEEE, 2020

work page 2020
[22]

Efficient frequency domain-based trans- formers for high-quality image deblurring

Lingshun Kong, Jiangxin Dong, Jianjun Ge, Mingqiang Li, and Jinshan Pan. Efficient frequency domain-based trans- formers for high-quality image deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5886–5895, 2023

work page 2023
[23]

Efficient visual state space model for image deblurring

Lingshun Kong, Jiangxin Dong, Jinhui Tang, Ming-Hsuan Yang, and Jinshan Pan. Efficient visual state space model for image deblurring. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12710–12719, 2025

work page 2025
[24]

Hammer: a large-scale, hand-object, multi- view, temporally-and-spatially-annotated dataset

Hamid Laga, Sutanu Jati, Ilaria Falco, Simone Melzi, Marco Manzo, Freek Stulp, Umberto Castellani, Antti Oulasvirta, and Chi Ren. Hammer: a large-scale, hand-object, multi- view, temporally-and-spatially-annotated dataset. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21008–21017. IEEE, 2022

work page 2022
[25]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: 12 Representing scenes as neural radiance fields for view syn- thesis, 2020

work page 2020
[26]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2671–2680, 2017

work page 2017
[27]

Bokehme: When neural rendering meets classical rendering

Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. Bokehme: When neural rendering meets classical rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[28]

Bokehme: When neural rendering meets classical rendering

Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. Bokehme: When neural rendering meets classical rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16283–16292, 2022

work page 2022
[29]

Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

work page 2025
[30]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016

work page 2016
[31]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

work page 2016
[32]

Eth3d: A benchmark for multi-view stereo

Thomas Sch ¨ops, Johannes L Sch ¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. Eth3d: A benchmark for multi-view stereo. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3954–3963. IEEE, 2017

work page 2017
[33]

Yichen Sheng, Zixun Yu, Lu Ling, Zhiwen Cao, Xuaner Zhang, Xin Lu, Ke Xian, Haiting Lin, and Bedrich Benes. Dr. bokeh: Differentiable occlusion-aware bokeh rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4515–4525, 2024

work page 2024
[34]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

work page 2012
[35]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Manel Galindo, Dhruv Jayaraman, Sudeep Ra- makrishnan, Daniel Gordon, Richard Newcombe, Georgia Gkioxari, and Jitendra Malik. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[37]

A bench- mark for the evaluation of rgb-d slam systems

J ¨urgen Sturm, Jakob Engel, and Daniel Cremers. A bench- mark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580. IEEE, 2012

work page 2012
[38]

Diode: A dense indoor and outdoor depth dataset

Igor Vlasic, Maria Shugrina, Or Litany, Angela Dai, and Matthias Nießner. Diode: A dense indoor and outdoor depth dataset. In2019 International Conference on 3D Vision (3DV), pages 310–320. IEEE, 2019

work page 2019
[39]

V oid: A new dataset and a baseline for void region filling

Lei Wang, Jian-Fang Zhang, Yebin Wang, Kun Yu, Yizhou Liu, and Tian Wu. V oid: A new dataset and a baseline for void region filling. InIEEE Transactions on Pattern Analysis and Machine Intelligence, pages 3155–3169. IEEE, 2020

work page 2020
[40]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19701–19710, 2024

work page 2024
[42]

Foundationstereo: Zero- shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

work page 2025
[43]

Unsuper- vised monocular depth learning in dynamic scenes

Alex Wong, Wei-Chih Chiu, and Stefano Soatto. Unsuper- vised monocular depth learning in dynamic scenes. InCon- ference on Robot Learning, pages 1016–1031. PMLR, 2020

work page 2020
[44]

nlmvs-net: Deep non-lambertian multi-view stereo

Kohei Yamashita, Yuto Enyo, Shohei Nobuhara, and Ko Nishino. nlmvs-net: Deep non-lambertian multi-view stereo. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), 2023

work page 2023
[45]

Depth any- thing v2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2, 2024

work page 2024
[46]

A practical 3d reconstruc- tion method for weak texture scenes.Remote Sensing, 13 (16), 2021

Xuyuan Yang and Guang Jiang. A practical 3d reconstruc- tion method for weak texture scenes.Remote Sensing, 13 (16), 2021

work page 2021
[47]

3d visual illusion depth estimation.arXiv preprint arXiv:2505.13061, 2025

Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, and Yunde Jia. 3d visual illusion depth estimation.arXiv preprint arXiv:2505.13061, 2025

work page arXiv 2025
[48]

Scan- net++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Shubham Tulsiani, Ishan Nerurkar, Georgia Gkioxari, Jitendra Malik, and Angela Dai. Scan- net++: A high-fidelity dataset of 3d indoor scenes. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 20997–21007. IEEE, 2024

work page 2024
[49]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

work page 2022
[50]

Blur-aware lens blur synthesis

Jhih-Ciang Zheng, Fu-Jen Tung, and Michael S Brown. Blur-aware lens blur synthesis. In2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 17845–17854. IEEE, 2022

work page 2022
[51]

Bokehdiff: Neu- ral lens blur with one-step diffusion

Chengxuan Zhu, Qingnan Fan, Qi Zhang, Jinwei Chen, Huaqi Zhang, Chao Xu, and Boxin Shi. Bokehdiff: Neu- ral lens blur with one-step diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9508–9518, 2025. 13

work page 2025