pith. sign in

arxiv: 2605.23281 · v1 · pith:CTG7RUAHnew · submitted 2026-05-22 · 💻 cs.CV

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

Pith reviewed 2026-05-25 04:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationexpert selectionvision-language agentmodel fusioncamera geometryreinforcement learninguniversal depth estimationsample-wise complementarity
0
0 comments X

The pith

DepthAgent uses a vision-language agent to select and fuse depth experts sample by sample for improved monocular depth across camera types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that depth estimation models have complementary strengths strongly tied to camera geometry, with the largest benefits from fusion appearing on difficult samples. A vision-language agent can exploit this by analyzing scene and camera cues, calling depth models as tools in multiple turns, and choosing or combining outputs. This is optimized through multi-reward reinforcement fine-tuning that rewards valid tool use, cue analysis, selection quality, and efficiency. If the approach holds, it offers a route to more robust universal depth estimation without retraining the underlying experts, delivering gains precisely where single models are weakest.

Core claim

Depth experts exhibit strong sample-wise complementarity correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples. DepthAgent treats existing depth models as frozen tools, learns to analyze scene and camera cues, invokes suitable experts through multi-turn tool utilization, and selects or fuses their predictions for each input, with the decisions optimized by a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency.

What carries the argument

DepthAgent, a vision-language agent that performs multi-turn tool calls on frozen depth estimators after extracting camera and scene cues to decide on selection or fusion.

If this is right

  • DepthAgent outperforms individual experts, fixed model fusion, and alternative selection strategies on perspective, fisheye, and panoramic benchmarks.
  • The largest accuracy gains occur on challenging samples where individual experts are unreliable.
  • The multi-reward reinforcement scheme produces decisions that balance analysis quality, selection accuracy, tool validity, and computational cost.
  • The method works by keeping all depth models frozen and routing each input to one or more of them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cue-driven routing could be tested on other dense prediction tasks that already have multiple specialized models, such as surface normal or semantic segmentation.
  • The emphasis on explicit camera-geometry analysis may transfer to selection strategies in other geometric vision problems that suffer from domain shift.
  • If the agent overhead remains low, the efficiency term in the reward could make the method viable for online systems that must choose models on the fly.

Load-bearing premise

The vision-language model can reliably extract camera geometry and scene cues to drive valid multi-turn expert selection and fusion decisions that improve dense depth quality.

What would settle it

Replace the agent's selections with random expert choices on a benchmark containing mixed perspective, fisheye, and panoramic images; if the resulting depth accuracy shows no drop relative to the learned policy and no loss of gains on hard samples, the adaptive mechanism adds no value.

Figures

Figures reproduced from arXiv: 2605.23281 by Girish Chandar Ganesan, Jie Zhu, Xiaoming Liu.

Figure 1
Figure 1. Figure 1: Motivation of DepthAgent. Real-world inputs span heterogeneous camera domains, including perspective, fisheye, and panoramic images, for which different depth experts exhibit different strengths. Instead of relying on a single model for all inputs, DepthAgent uses scene and camera cues to select suitable expert(s) on a per-sample basis, producing more reliable depth maps across diverse camera settings. per… view at source ↗
Figure 2
Figure 2. Figure 2: Fusion consistently outperforms single models. Left: Dataset-wise oracle proportions of single￾model vs. multi-model solutions. Right: Fusion Gain against best single model. Bars and diamonds denote the mean and 90th percentile of per-sample δ1 fusion gain over the best single model; annotations show the fraction of samples where fusion achieves higher δ1. 3 Method Overview We first conduct fusion analysis… view at source ↗
Figure 3
Figure 3. Figure 3: Difficulty-dependent fusion gain. Mean fusion gain (∆δ1 ± σ) is shown across best-single δ1 quintiles within each dataset group, where Q1/Q5 denote the hardest/easiest samples. Fusion gains are largest on hard samples (the strongest individual model performs poorly). Dashed lines indicate linear trends over quintile means, while Pearson r is computed over pooled per-sample pairs. datasets: fusion achieves … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of DepthAgent. Given an input image, DepthAgent analyzes the scene type and camera intrinsics, and interactively selects depth experts from a depth-expert tool pool. Each tool call returns a depth prediction along with auxiliary depth features. Based on the tool results, DepthAgent determines whether to continue exploration or produce the final solution, which includes the final depth map generate… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of DepthAgent behavior. 4.3 Qualitative Results Depthmap comparison. Fig. 5a compares DepthAgent with the top-2 individual experts on rep￾resentative samples. DepthAgent produces more faithful depth structures and consistently lower error maps by selecting or combining complementary expert solutions. Additional visualizations are provided in the Appendix. Solution distribution on different scenari… view at source ↗
read the original abstract

Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces DepthAgent, a vision-language model agent that treats existing monocular depth estimators as frozen tools. It performs multi-turn reasoning to extract camera geometry and scene cues, invoke suitable experts, and select or fuse their outputs per sample. A multi-reward reinforcement fine-tuning scheme optimizes tool execution, cue analysis, selection quality, and efficiency. Experiments on perspective, fisheye, and panoramic benchmarks claim consistent outperformance over single experts, fixed fusions, and alternative selection strategies, with largest gains on challenging samples where individual models fail.

Significance. If the central mechanism is validated, the work would demonstrate sample-wise complementarity among depth experts tied to camera geometry and show that agentic, cue-driven selection can outperform static ensembles for universal depth estimation. The planned code and model release would support reproducibility. The empirical focus on difficult samples and multi-camera settings addresses a practical deployment gap.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (motivation and method overview): the claim that 'model preference is highly correlated with camera geometry' and that the agent 'learns to analyze these cues' is load-bearing for the central contribution, yet no quantitative validation is reported (e.g., accuracy of predicted intrinsics, camera type classification, or scene descriptors against ground truth). End-to-end depth metrics alone cannot isolate whether gains arise from informed cue-driven decisions or from generic averaging/reward shaping.
  2. [§4] §4 (experiments) and associated tables: while outperformance on challenging samples is asserted, the manuscript provides no ablation that holds the fusion mechanics fixed while varying cue quality (or vice versa), nor controls that compare against post-hoc selection without the VLM agent. This leaves the mechanism unconfirmed relative to the skeptic's concern.
  3. [§3.2] §3.2 (multi-reward RL scheme): the multi-reward formulation includes weights on tool execution, cue analysis, selection quality, and efficiency; without an ablation on these weights or a demonstration that cue-analysis reward is necessary for the reported gains, it remains unclear whether the VLM's scene/camera reasoning is causally responsible for the improvements.
minor comments (3)
  1. [Abstract] Abstract states 'consistent outperformance' but supplies no numerical deltas, dataset names, or error metrics; moving at least headline numbers into the abstract would improve readability.
  2. [§3] Notation for the multi-turn tool calls and reward components could be clarified with a compact table or pseudocode listing the exact action space and reward terms.
  3. [Figures in §4] Figure captions and axis labels in the result figures should explicitly state the evaluation metric (e.g., AbsRel, RMSE) and whether lower or higher is better.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the validation of the central claims.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (motivation and method overview): the claim that 'model preference is highly correlated with camera geometry' and that the agent 'learns to analyze these cues' is load-bearing for the central contribution, yet no quantitative validation is reported (e.g., accuracy of predicted intrinsics, camera type classification, or scene descriptors against ground truth). End-to-end depth metrics alone cannot isolate whether gains arise from informed cue-driven decisions or from generic averaging/reward shaping.

    Authors: We agree that quantitative validation of cue analysis is necessary to isolate the mechanism. In the revised manuscript we will add a dedicated evaluation subsection reporting the agent's accuracy on camera intrinsics prediction, camera-type classification, and scene descriptor extraction against ground-truth annotations on a held-out subset. These metrics will be presented alongside the depth results to demonstrate that performance gains arise from informed cue-driven decisions. revision: yes

  2. Referee: [§4] §4 (experiments) and associated tables: while outperformance on challenging samples is asserted, the manuscript provides no ablation that holds the fusion mechanics fixed while varying cue quality (or vice versa), nor controls that compare against post-hoc selection without the VLM agent. This leaves the mechanism unconfirmed relative to the skeptic's concern.

    Authors: We acknowledge the absence of these targeted controls. The revision will include two new ablations: (1) fusion mechanics held fixed while cue quality is varied (oracle cues versus agent-generated cues), and (2) direct comparison of the full VLM agent against post-hoc selection baselines that lack the agent's multi-turn reasoning. These experiments will be added to §4 and the associated tables. revision: yes

  3. Referee: [§3.2] §3.2 (multi-reward RL scheme): the multi-reward formulation includes weights on tool execution, cue analysis, selection quality, and efficiency; without an ablation on these weights or a demonstration that cue-analysis reward is necessary for the reported gains, it remains unclear whether the VLM's scene/camera reasoning is causally responsible for the improvements.

    Authors: We agree that an ablation isolating the cue-analysis reward component is required. The revised version will report results from ablating the reward weights and from removing or heavily down-weighting the cue-analysis term, demonstrating its necessity for the observed gains. These results will be added to §3.2 and the experimental section. revision: yes

Circularity Check

0 steps flagged

Empirical agent method with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical system (DepthAgent) that trains a VLM-based agent via multi-reward RL to select/fuse frozen depth experts. All reported gains are measured on external benchmarks (perspective, fisheye, panoramic) against baselines; no equations, uniqueness theorems, or performance metrics are defined in terms of the method's own fitted parameters or prior self-citations. The observed complementarity is an experimental finding, not a constructed identity. This is a standard empirical ML paper whose central claims rest on held-out test metrics rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption of strong sample-wise complementarity between depth experts and on several design choices in the RL reward scheme whose values are not reported.

free parameters (1)
  • multi-reward weights
    Weights balancing tool execution, analysis quality, depth accuracy, and efficiency are introduced in the RL fine-tuning scheme.
axioms (1)
  • domain assumption Depth experts exhibit strong sample-wise complementarity correlated with camera geometry
    Stated as an observation motivating the agent design.
invented entities (1)
  • DepthAgent no independent evidence
    purpose: Vision-language agent for adaptive expert selection and fusion
    New system introduced to perform the selection task.

pith-pipeline@v0.9.0 · 5773 in / 1220 out tokens · 28217 ms · 2026-05-25T04:26:58.064363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 20 internal anchors

  1. [1]

    Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions

    Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023

  2. [2]

    Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation

    Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Alvarez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727–3737, 2021

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  5. [5]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

  6. [6]

    Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

    Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  7. [7]

    Atm: Action temporality modeling for video question answering

    Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action temporality modeling for video question answering. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4886–4895, 2023

  8. [8]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

  9. [9]

    On the suitability of reinforcement fine-tuning to visual tasks

    Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InCVPR, 2025

  10. [10]

    Multi-resolution monocular depth map fusion by self-supervised gradient-based composition

    Yaqiao Dai, Renjiao Yi, Chenyang Zhu, Hongjun He, and Kai Xu. Multi-resolution monocular depth map fusion by self-supervised gradient-based composition. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

  11. [11]

    Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

    Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

  12. [12]

    Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality

    Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020

  13. [13]

    Simfir: A simple framework for fisheye image rectification with self-supervised representation learning

    Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learning. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

  14. [14]

    Unidac: Universal metric depth estimation for any camera

    Girish Chandar Ganesan, Yuliang Guo, Liu Ren, and Xiaoming Liu. Unidac: Universal metric depth estimation for any camera. InCVPR, 2026

  15. [15]

    Are we ready for autonomous driving? The KITTI vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012

  16. [16]

    A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

    Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, et al. A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

  17. [17]

    3d packing for self- supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  18. [18]

    Towards zero-shot scale-aware monocular depth estimation

    Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9233–9243, 2023. 10

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

    Xiao Guo, Jie Zhu, Anil Jain, and Xiaoming Liu. On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

  21. [21]

    Depth any camera: Zero-shot metric depth estimation from any camera

    Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025

  22. [22]

    A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

    Dong Han, Beni Mulyana, Vladimir Stankovic, and Samuel Cheng. A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

  23. [23]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero- shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

  24. [24]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  25. [25]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  26. [26]

    Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

  27. [27]

    Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021

  28. [28]

    Reinforcement learning: A survey

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996

  29. [29]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  30. [30]

    Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020

    Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020. doi: 10.1016/j.cviu.2019.102877

  31. [31]

    Deviant: Depth equivariant network for monocular 3d object detection

    Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. Deviant: Depth equivariant network for monocular 3d object detection. InECCV, 2022

  32. [32]

    Charm3r: Towards unseen camera height robust monocular 3d detector

    Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, and Xiaoming Liu. Charm3r: Towards unseen camera height robust monocular 3d detector. InICCV, 2025

  33. [33]

    Simmlm: A simple framework for multi-modal learning with missing modality

    Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24068–24077, 2025

  34. [34]

    Omnifusion: 360 monocular depth estimation via geometry-aware fusion

    Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estimation via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022

  35. [35]

    Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

    Feng Liu, Nicholas Chimitt, Lanqing Guo, Jitesh Jain, Aditya Kane, Minchul Kim, Wes Robbins, Yiyang Su, Dingqiang Ye, Xingguang Zhang, et al. Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

  36. [36]

    Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

    Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

  37. [37]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026. 11

  38. [38]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  39. [39]

    Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

  40. [40]

    Routing to the expert: Efficient reward-guided ensemble of large language models

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

  41. [41]

    Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

    Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

  42. [42]

    Deep learning, reinforcement learning, and world models.Neural Networks, 2022

    Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 2022

  43. [43]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InThe European Conference on Computer Vision (ECCV), 2012

  44. [44]

    The fourth monocular depth estimation challenge

    Anton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russel, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, et al. The fourth monocular depth estimation challenge. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  45. [45]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024

  46. [46]

    Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

    Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

  47. [47]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  48. [48]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

  49. [49]

    Unik3d: Universal camera monocular 3d estimation

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025

  50. [50]

    Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

  51. [51]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

  52. [52]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  53. [53]

    Adaptive co-teaching for unsupervised monocular depth estimation

    Weisong Ren, Lijun Wang, Yongri Piao, Miao Zhang, Huchuan Lu, and Ting Liu. Adaptive co-teaching for unsupervised monocular depth estimation. InEuropean Conference on Computer Vision, pages 89–105. Springer, 2022

  54. [54]

    360monodepth: High-resolution 360 monocular depth estimation

    Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth estimation. in 2022 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  55. [55]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 12

  56. [56]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  57. [57]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  58. [58]

    Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed...

  59. [59]

    Learning spherical convolution for fast features from 360° imagery

    Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360° imagery. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Dece...

  60. [60]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang- Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

  61. [61]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

  62. [62]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

  63. [63]

    Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019

  64. [64]

    Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

    Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, and Xiaoqiang Li. Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

  65. [65]

    Generating and exploiting probabilistic monocular depth estimates

    Zhihao Xia, Patrick Sullivan, and Ayan Chakrabarti. Generating and exploiting probabilistic monocular depth estimates. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 65–74, 2020

  66. [66]

    Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications

    Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. 2024

  67. [67]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024

  68. [68]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

  69. [69]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

  70. [70]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

  71. [71]

    Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts

    Hanrong Ye and Dan Xu. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. InProceedings of the IEEE/CVF international conference on computer vision, pages 21828–21837, 2023

  72. [72]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13

  73. [73]

    Learning to recover 3d scene shape from a single image

    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021

  74. [74]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023

  75. [75]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  76. [76]

    Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation

    Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

  77. [77]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

  78. [78]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

  79. [79]

    Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

    Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

  80. [80]

    Unleashing the power of chain-of-prediction for monocular 3d object detection

    Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, and Xiaoming Liu. Unleashing the power of chain-of-prediction for monocular 3d object detection. InCVPR, 2026

Showing first 80 references.