Recognition: 2 theorem links
· Lean TheoremDepth Anything V2
Pith reviewed 2026-05-13 14:51 UTC · model grok-4.3
The pith
Depth Anything V2 produces finer and more robust monocular depth predictions than V1 by training exclusively on synthetic images and pseudo-labeled real data from a scaled teacher.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a scaled teacher solely on synthetic images and then using the teacher to label large numbers of real images, the resulting student models produce significantly finer and more robust depth predictions than Depth Anything V1 while remaining far more efficient than diffusion-based depth estimators.
What carries the argument
A teacher-student distillation pipeline in which a large teacher trained on synthetic images generates pseudo-labels that bridge to the training of smaller student models on real photographs.
If this is right
- Models from 25M to 1.3B parameters support applications with different speed and accuracy needs.
- Fine-tuning the same backbone on metric depth labels yields accurate absolute-depth outputs.
- Inference speed exceeds that of Stable Diffusion depth models by more than a factor of ten.
- A new benchmark with precise ground truth and broad scene coverage replaces older limited test sets.
Where Pith is reading between the lines
- High-quality synthetic data can substitute for scarce real labels in dense prediction tasks.
- Teacher capacity scaling appears decisive for producing pseudo-labels that transfer reliably.
- The same synthetic-to-pseudo-label route may improve related tasks such as normal estimation or optical flow.
Load-bearing premise
Synthetic images plus pseudo-labels from a scaled teacher will generalize to diverse real scenes without introducing systematic biases that hurt performance.
What would settle it
A controlled test set of real images from previously unseen scene types where a model trained on real labeled data records lower error rates than V2 would falsify the claim that the synthetic-plus-pseudo-label route is superior.
read the original abstract
This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Depth Anything V2, which improves monocular depth estimation over V1 by replacing labeled real images with synthetic data for teacher training, scaling the teacher model capacity, and using the teacher to generate pseudo-labels on large-scale real images for student training. It claims significantly finer and more robust predictions, over 10x faster inference and higher accuracy than Stable Diffusion-based models, provides models from 25M to 1.3B parameters, and introduces a new diverse evaluation benchmark with precise annotations.
Significance. If the empirical claims hold, the work offers a practical route to high-quality, efficient depth models that leverage synthetic data and pseudo-labeling at scale, with clear benefits for real-time applications. The provision of multiple model scales and a new benchmark with diverse scenes and precise annotations would support further research in the field.
major comments (2)
- [Abstract] Abstract: the central claim that the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) produce finer and more robust predictions than V1 or Stable Diffusion baselines rests on unverified experimental outcomes; no quantitative tables, ablation details isolating each practice, or error analysis on challenging real scenes are referenced.
- [Abstract] Abstract: the generalization advantage requires that the scaled synthetic-trained teacher generates pseudo-labels without systematic bias on real-world phenomena absent from synthetic data (e.g., complex reflections, low-light gradients, fine occlusion boundaries); without explicit tests or analysis addressing this, the student may inherit errors that undermine the reported robustness gains.
minor comments (1)
- [Abstract] Abstract: model parameter counts (25M to 1.3B) are listed but without corresponding speed/accuracy trade-offs or per-scale benchmark numbers to guide practitioners.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with references to the specific experimental results and sections that support our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) produce finer and more robust predictions than V1 or Stable Diffusion baselines rests on unverified experimental outcomes; no quantitative tables, ablation details isolating each practice, or error analysis on challenging real scenes are referenced.
Authors: The abstract is a concise summary; the full manuscript contains the supporting experiments. Table 1 reports direct quantitative comparisons against Depth Anything V1 and Stable Diffusion-based models on multiple benchmarks, showing consistent gains in accuracy and >10x inference speed. Section 4.2 presents ablations that isolate each of the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) with corresponding metrics. Section 5.3 provides both quantitative and qualitative error analysis on challenging real scenes, including fine structures and robustness under varying conditions. We will revise the abstract to explicitly reference these tables and sections. revision: partial
-
Referee: [Abstract] Abstract: the generalization advantage requires that the scaled synthetic-trained teacher generates pseudo-labels without systematic bias on real-world phenomena absent from synthetic data (e.g., complex reflections, low-light gradients, fine occlusion boundaries); without explicit tests or analysis addressing this, the student may inherit errors that undermine the reported robustness gains.
Authors: Our evaluation protocol directly tests generalization on real-world data containing the cited phenomena. The new benchmark introduced in Section 6 comprises diverse scenes with precise annotations that explicitly include complex reflections, low-light gradients, and fine occlusion boundaries. Tables 3 and 4 report that models trained via the pseudo-label bridge outperform both V1 and Stable Diffusion baselines on these subsets, with no evidence of systematic error inheritance. The large-scale real-image pseudo-labeling step is designed to adapt the student to real distributions, and the reported robustness improvements are measured on exactly these challenging cases. revision: no
Circularity Check
No significant circularity in empirical training pipeline
full rationale
The paper reports an empirical training recipe for monocular depth estimation: synthetic images replace real labeled data for the teacher, the teacher is scaled, and students are trained on its pseudo-labels on real images. These are design choices whose validity is assessed by external benchmarking on diverse test sets rather than any derivation that reduces to its own inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the result; the central claims rest on measured accuracy and efficiency gains against independent baselines.
Axiom & Free-Parameter Ledger
free parameters (2)
- teacher model capacity
- pseudo-label generation scale
Forward citations
Cited by 28 Pith papers
-
Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution
The paper provides the first theoretical analysis of multi-modal super-resolution and proposes M³ESR, a mixture-of-experts framework with spatially dynamic and temporally adaptive modality weighting that improves gene...
-
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
-
Triangulation of Points Constrained to a Plane
A closed-form formula is derived for the number of complex critical points in the planar triangulation problem, valid for any number of views.
-
Face Anything: 4D Face Reconstruction from Any Image Sequence
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
-
Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
Sat3DGen improves geometric RMSE from 6.76m to 5.20m and FID from ~40 to 19 for street-level 3D generation from satellite images via geometry-centric constraints and perspective training.
-
DegBins: Degradation-Driven Binning for Depth Super-Resolution
DegBins uses degradation-driven binning and multi-stage refinement to turn residual depth regression into a more flexible hybrid classification-regression problem that outperforms prior depth super-resolution methods ...
-
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
-
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo
Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.
-
DINO-VO: Learning Where to Focus for Enhanced State Estimation
DINO-VO achieves state-of-the-art monocular visual odometry accuracy and generalization by training a differentiable patch selector together with multi-task features and inverse-depth bundle adjustment.
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting
ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.
-
Why Domain Matters: A Preliminary Study of Domain Effects in Underwater Object Detection
A framework labels underwater images by physical characteristics to group them semantically and evaluate object detection performance across real domain factors.
-
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some s...
-
Physics-Informed Neural Optimal Control for Precision Immobilization Technique in Emergency Scenarios
A distilled physics-informed neural surrogate in a hierarchical optimal control architecture raises simulated PIT success from 63.8% to 76.7% and succeeds in three of four low-speed scaled-vehicle tests.
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
-
Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement
A three-stage progressive refinement model guided by DINOv2 semantics and geometric depth/normals cues won the NTIRE 2026 image shadow removal challenge with top scores of 26.68 PSNR and 0.874 SSIM.
-
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.
-
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
The NTIRE 2026 challenge reports measurable progress in 3D reconstruction pipelines that handle real-world low-light and smoke degradation via the RealX3D benchmark.
Reference graph
Works this paper leans on
-
[1]
Mapillary planet-scale depth dataset
Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. In ECCV, 2020. 12
work page 2020
-
[2]
Do deep nets really need to be deep? In NeurIPS, 2014
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NeurIPS, 2014. 10
work page 2014
-
[3]
Probing the 3d awareness of visual foundation models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. In CVPR, 2024. 8, 14
work page 2024
-
[4]
Beit: Bert pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In ICLR, 2022. 2, 5, 12, 14
work page 2022
-
[5]
Adabins: Depth estimation using adaptive bins
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In CVPR, 2021. 9, 10
work page 2021
-
[6]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288, 2023. 2, 9, 20
work page internal anchor Pith review arXiv 2023
-
[7]
1–a model zoo for robust monocular relative depth estimation
Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460, 2023. 2, 3, 5, 8, 9, 10, 13, 16
-
[8]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012. 8, 9, 10, 12, 13, 14
work page 2012
-
[9]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv:2001.10773,
work page internal anchor Pith review arXiv 2001
-
[10]
Learning lightweight object detectors via multi-teacher progressive distillation
Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yu-Xiong Wang, and Liangyan Gui. Learning lightweight object detectors via multi-teacher progressive distillation. In ICML,
-
[11]
Single-image depth perception in the wild
Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In NeurIPS, 2016. 7, 8, 16
work page 2016
-
[12]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2023. 12
work page 2023
-
[13]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 12
work page 2022
-
[14]
Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes
Jaehoon Cho, Dongbo Min, Youngjung Kim, and Kwanghoon Sohn. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes. arXiv:2110.11590, 2021. 4, 10
-
[15]
Learning depth estimation for transparent and mirror surfaces
Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano. Learning depth estimation for transparent and mirror surfaces. In ICCV,
-
[16]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024. 14
work page 2024
-
[17]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR,
-
[18]
Depth map prediction from a single image using a multi-scale deep network
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014. 10
work page 2014
-
[19]
Deep ordinal regression network for monocular depth estimation
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018. 10
work page 2018
-
[20]
Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv:2403.12013, 2024. 2, 4, 7, 8, 9, 10 25
-
[21]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv:2404.12390, 2024. 8
-
[22]
Unsupervised domain adaptation by backpropagation
Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015. 10
work page 2015
-
[23]
Domain-adversarial training of neural networks
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016. 10
work page 2016
-
[24]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 2013. 7, 8, 9, 10, 12, 13, 14
work page 2013
-
[25]
Depthfm: Fast monocular depth estimation with flow matching
Ming Gui, Johannes S Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Depthfm: Fast monocular depth estimation with flow matching. arXiv:2403.13788, 2024. 1, 2, 4, 8, 10
-
[26]
Towards zero-shot scale-aware monocular depth estimation
Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In ICCV, 2023. 2, 3
work page 2023
-
[27]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015. 6, 10
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv:2404.15506, 2024. 2, 3, 5, 8
-
[29]
One- former: One transformer to rule universal image segmentation
Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. One- former: One transformer to rule universal image segmentation. In CVPR, 2023. 12
work page 2023
-
[30]
Ddp: Diffusion model for dense visual prediction
Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. Ddp: Diffusion model for dense visual prediction. In ICCV, 2023. 12
work page 2023
-
[31]
Repurposing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In CVPR, 2024. 1, 2, 4, 7, 8, 9, 10, 16, 19
work page 2024
-
[32]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 2023. 2
work page 2023
-
[33]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023. 5, 7, 9, 12, 13, 14, 22
work page 2023
-
[34]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images
-
[35]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 12, 22
work page 2020
-
[36]
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICMLW, 2013. 10
work page 2013
-
[37]
Megadepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 3, 4
work page 2018
-
[38]
Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In CVPR, 2024. 2
work page 2024
-
[39]
Magicedit: High-fidelity and temporally coherent video editing,
Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. Magicedit: High-fidelity and temporally coherent video editing. arXiv:2308.14749, 2023. 2, 17 26
-
[40]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,
-
[41]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 8
work page 2023
-
[42]
Curvefusion: reconstructing thin structures from rgbd sequences
Lingjie Liu, Nenglun Chen, Duygu Ceylan, Christian Theobalt, Wenping Wang, and Niloy J Mitra. Curvefusion: reconstructing thin structures from rgbd sequences. TOG, 2018. 2
work page 2018
-
[43]
Structured knowledge distillation for semantic segmentation
Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In CVPR, 2019. 10
work page 2019
-
[44]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022. 9
work page 2022
-
[45]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV,
-
[46]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. 12
work page 2022
-
[47]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV,
-
[48]
Improved knowledge distillation via teacher assistant
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI, 2020. 6, 10
work page 2020
-
[49]
All in tokens: Unifying output space of visual tasks via soft token
Jia Ning, Chen Li, Zheng Zhang, Chunyu Wang, Zigang Geng, Qi Dai, Kun He, and Han Hu. All in tokens: Unifying output space of visual tasks via soft token. In ICCV, 2023. 9
work page 2023
-
[50]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023. 2, 5, 13, 14
work page 2023
-
[51]
P3depth: Monocular depth estimation with a piecewise planarity prior
Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022. 9
work page 2022
-
[52]
Unidepth: Universal monocular metric depth estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In CVPR, 2024. 2
work page 2024
-
[53]
Unrealcv: Virtual worlds for computer vision
Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao Xiao, Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds for computer vision. In ACM MM, 2017. 4
work page 2017
-
[54]
Open challenges in deep stereo: the booster dataset
Pierluigi Zama Ramirez, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Open challenges in deep stereo: the booster dataset. In CVPR, 2022. 4, 13
work page 2022
-
[55]
Vision transformers for dense predic- tion
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense predic- tion. In ICCV, 2021. 8, 9, 10
work page 2021
-
[56]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2022. 2, 3, 6, 10, 13, 14
work page 2022
-
[57]
Playing for data: Ground truth from computer games
Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016. 4
work page 2016
-
[58]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021. 4, 9, 12, 16 27
work page 2021
-
[59]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 10
work page 2022
-
[60]
Imagenet large scale visual recognition challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. 12, 22
work page 2015
-
[61]
Learning from synthetic data: Addressing domain shift for semantic segmentation
Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In CVPR,
-
[62]
A multi-view stereo benchmark with high-resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR, 2017. 8, 9, 10, 12, 13, 14
work page 2017
-
[63]
Airsim: High-fidelity visual and physical simulation for autonomous vehicles
Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017. 4
work page 2017
-
[64]
Inserf: Text-driven generative object insertion in neural 3d scenes
Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, and Federico Tombari. Inserf: Text-driven generative object insertion in neural 3d scenes. arXiv:2401.05335, 2024. 2
-
[65]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019. 12, 22
work page 2019
-
[66]
Nddepth: Normal-distance assisted monocular depth estimation
Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In ICCV, 2023. 9
work page 2023
-
[67]
Iebins: Iterative elastic bins for monocular depth estimation
Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation. In NeurIPS, 2023. 9
work page 2023
-
[68]
Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,
Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. Realmdreamer: Text- driven 3d scene generation with inpainting and depth diffusion. arXiv:2404.07199, 2024. 2
-
[69]
Channel-wise knowledge distillation for dense prediction
Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. In ICCV, 2021. 10
work page 2021
-
[70]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 3, 7, 8, 9, 10, 12, 13, 14, 16
work page 2012
-
[71]
Fixmatch: Simplifying semi- supervised learning with consistency and confidence
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence. In NeurIPS, 2020. 10
work page 2020
-
[72]
The third monocular depth estimation challenge
Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, et al. The third monocular depth estimation challenge. arXiv:2404.16831, 2024. 2, 6
-
[73]
Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NeurIPS, 2015. 10
work page 2015
-
[74]
Segmenter: Transformer for semantic segmentation
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021. 12
work page 2021
-
[75]
Learning vision from models rivals learning vision from data
Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. In CVPR, 2024. 5, 14
work page 2024
-
[76]
Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv:1908.00463, 2019. 8, 9, 10, 12, 13, 14 28
-
[77]
Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In ICME, 2021. 12
work page 2021
-
[78]
Internimage: Exploring large-scale vision foundation models with deformable convolutions
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023. 12
work page 2023
-
[79]
Tartanair: A dataset to push the limits of visual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In IROS, 2020. 12
work page 2020
-
[80]
Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019. 2
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.