DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

Girish Chandar Ganesan; Jie Zhu; Xiaoming Liu

arxiv: 2605.23281 · v1 · pith:CTG7RUAHnew · submitted 2026-05-22 · 💻 cs.CV

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

Jie Zhu , Girish Chandar Ganesan , Xiaoming Liu This is my paper

Pith reviewed 2026-05-25 04:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular depth estimationexpert selectionvision-language agentmodel fusioncamera geometryreinforcement learninguniversal depth estimationsample-wise complementarity

0 comments

The pith

DepthAgent uses a vision-language agent to select and fuse depth experts sample by sample for improved monocular depth across camera types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that depth estimation models have complementary strengths strongly tied to camera geometry, with the largest benefits from fusion appearing on difficult samples. A vision-language agent can exploit this by analyzing scene and camera cues, calling depth models as tools in multiple turns, and choosing or combining outputs. This is optimized through multi-reward reinforcement fine-tuning that rewards valid tool use, cue analysis, selection quality, and efficiency. If the approach holds, it offers a route to more robust universal depth estimation without retraining the underlying experts, delivering gains precisely where single models are weakest.

Core claim

Depth experts exhibit strong sample-wise complementarity correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples. DepthAgent treats existing depth models as frozen tools, learns to analyze scene and camera cues, invokes suitable experts through multi-turn tool utilization, and selects or fuses their predictions for each input, with the decisions optimized by a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency.

What carries the argument

DepthAgent, a vision-language agent that performs multi-turn tool calls on frozen depth estimators after extracting camera and scene cues to decide on selection or fusion.

If this is right

DepthAgent outperforms individual experts, fixed model fusion, and alternative selection strategies on perspective, fisheye, and panoramic benchmarks.
The largest accuracy gains occur on challenging samples where individual experts are unreliable.
The multi-reward reinforcement scheme produces decisions that balance analysis quality, selection accuracy, tool validity, and computational cost.
The method works by keeping all depth models frozen and routing each input to one or more of them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cue-driven routing could be tested on other dense prediction tasks that already have multiple specialized models, such as surface normal or semantic segmentation.
The emphasis on explicit camera-geometry analysis may transfer to selection strategies in other geometric vision problems that suffer from domain shift.
If the agent overhead remains low, the efficiency term in the reward could make the method viable for online systems that must choose models on the fly.

Load-bearing premise

The vision-language model can reliably extract camera geometry and scene cues to drive valid multi-turn expert selection and fusion decisions that improve dense depth quality.

What would settle it

Replace the agent's selections with random expert choices on a benchmark containing mixed perspective, fisheye, and panoramic images; if the resulting depth accuracy shows no drop relative to the learned policy and no loss of gains on hard samples, the adaptive mechanism adds no value.

Figures

Figures reproduced from arXiv: 2605.23281 by Girish Chandar Ganesan, Jie Zhu, Xiaoming Liu.

**Figure 1.** Figure 1: Motivation of DepthAgent. Real-world inputs span heterogeneous camera domains, including perspective, fisheye, and panoramic images, for which different depth experts exhibit different strengths. Instead of relying on a single model for all inputs, DepthAgent uses scene and camera cues to select suitable expert(s) on a per-sample basis, producing more reliable depth maps across diverse camera settings. per… view at source ↗

**Figure 2.** Figure 2: Fusion consistently outperforms single models. Left: Dataset-wise oracle proportions of singlemodel vs. multi-model solutions. Right: Fusion Gain against best single model. Bars and diamonds denote the mean and 90th percentile of per-sample δ1 fusion gain over the best single model; annotations show the fraction of samples where fusion achieves higher δ1. 3 Method Overview We first conduct fusion analysis… view at source ↗

**Figure 3.** Figure 3: Difficulty-dependent fusion gain. Mean fusion gain (∆δ1 ± σ) is shown across best-single δ1 quintiles within each dataset group, where Q1/Q5 denote the hardest/easiest samples. Fusion gains are largest on hard samples (the strongest individual model performs poorly). Dashed lines indicate linear trends over quintile means, while Pearson r is computed over pooled per-sample pairs. datasets: fusion achieves … view at source ↗

**Figure 4.** Figure 4: Overview of DepthAgent. Given an input image, DepthAgent analyzes the scene type and camera intrinsics, and interactively selects depth experts from a depth-expert tool pool. Each tool call returns a depth prediction along with auxiliary depth features. Based on the tool results, DepthAgent determines whether to continue exploration or produce the final solution, which includes the final depth map generate… view at source ↗

**Figure 5.** Figure 5: Analysis of DepthAgent behavior. 4.3 Qualitative Results Depthmap comparison. Fig. 5a compares DepthAgent with the top-2 individual experts on representative samples. DepthAgent produces more faithful depth structures and consistently lower error maps by selecting or combining complementary expert solutions. Additional visualizations are provided in the Appendix. Solution distribution on different scenari… view at source ↗

read the original abstract

Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DepthAgent adds a VLM agent with multi-turn tool calls and multi-reward RL for per-sample depth expert selection, but the abstract gives no direct check that cue extraction drives the gains.

read the letter

The paper's core move is to treat existing depth estimators as frozen tools and train a vision-language agent to analyze inputs, call the right ones over multiple turns, and fuse outputs when needed. It reports that this beats single models, fixed fusions, and other selection baselines, with bigger lifts on hard samples across perspective, fisheye, and panoramic data. The authors also note that model preference tracks camera geometry, which motivates the adaptive approach.

Referee Report

3 major / 3 minor

Summary. The paper introduces DepthAgent, a vision-language model agent that treats existing monocular depth estimators as frozen tools. It performs multi-turn reasoning to extract camera geometry and scene cues, invoke suitable experts, and select or fuse their outputs per sample. A multi-reward reinforcement fine-tuning scheme optimizes tool execution, cue analysis, selection quality, and efficiency. Experiments on perspective, fisheye, and panoramic benchmarks claim consistent outperformance over single experts, fixed fusions, and alternative selection strategies, with largest gains on challenging samples where individual models fail.

Significance. If the central mechanism is validated, the work would demonstrate sample-wise complementarity among depth experts tied to camera geometry and show that agentic, cue-driven selection can outperform static ensembles for universal depth estimation. The planned code and model release would support reproducibility. The empirical focus on difficult samples and multi-camera settings addresses a practical deployment gap.

major comments (3)

[Abstract, §3] Abstract and §3 (motivation and method overview): the claim that 'model preference is highly correlated with camera geometry' and that the agent 'learns to analyze these cues' is load-bearing for the central contribution, yet no quantitative validation is reported (e.g., accuracy of predicted intrinsics, camera type classification, or scene descriptors against ground truth). End-to-end depth metrics alone cannot isolate whether gains arise from informed cue-driven decisions or from generic averaging/reward shaping.
[§4] §4 (experiments) and associated tables: while outperformance on challenging samples is asserted, the manuscript provides no ablation that holds the fusion mechanics fixed while varying cue quality (or vice versa), nor controls that compare against post-hoc selection without the VLM agent. This leaves the mechanism unconfirmed relative to the skeptic's concern.
[§3.2] §3.2 (multi-reward RL scheme): the multi-reward formulation includes weights on tool execution, cue analysis, selection quality, and efficiency; without an ablation on these weights or a demonstration that cue-analysis reward is necessary for the reported gains, it remains unclear whether the VLM's scene/camera reasoning is causally responsible for the improvements.

minor comments (3)

[Abstract] Abstract states 'consistent outperformance' but supplies no numerical deltas, dataset names, or error metrics; moving at least headline numbers into the abstract would improve readability.
[§3] Notation for the multi-turn tool calls and reward components could be clarified with a compact table or pseudocode listing the exact action space and reward terms.
[Figures in §4] Figure captions and axis labels in the result figures should explicitly state the evaluation metric (e.g., AbsRel, RMSE) and whether lower or higher is better.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the validation of the central claims.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (motivation and method overview): the claim that 'model preference is highly correlated with camera geometry' and that the agent 'learns to analyze these cues' is load-bearing for the central contribution, yet no quantitative validation is reported (e.g., accuracy of predicted intrinsics, camera type classification, or scene descriptors against ground truth). End-to-end depth metrics alone cannot isolate whether gains arise from informed cue-driven decisions or from generic averaging/reward shaping.

Authors: We agree that quantitative validation of cue analysis is necessary to isolate the mechanism. In the revised manuscript we will add a dedicated evaluation subsection reporting the agent's accuracy on camera intrinsics prediction, camera-type classification, and scene descriptor extraction against ground-truth annotations on a held-out subset. These metrics will be presented alongside the depth results to demonstrate that performance gains arise from informed cue-driven decisions. revision: yes
Referee: [§4] §4 (experiments) and associated tables: while outperformance on challenging samples is asserted, the manuscript provides no ablation that holds the fusion mechanics fixed while varying cue quality (or vice versa), nor controls that compare against post-hoc selection without the VLM agent. This leaves the mechanism unconfirmed relative to the skeptic's concern.

Authors: We acknowledge the absence of these targeted controls. The revision will include two new ablations: (1) fusion mechanics held fixed while cue quality is varied (oracle cues versus agent-generated cues), and (2) direct comparison of the full VLM agent against post-hoc selection baselines that lack the agent's multi-turn reasoning. These experiments will be added to §4 and the associated tables. revision: yes
Referee: [§3.2] §3.2 (multi-reward RL scheme): the multi-reward formulation includes weights on tool execution, cue analysis, selection quality, and efficiency; without an ablation on these weights or a demonstration that cue-analysis reward is necessary for the reported gains, it remains unclear whether the VLM's scene/camera reasoning is causally responsible for the improvements.

Authors: We agree that an ablation isolating the cue-analysis reward component is required. The revised version will report results from ablating the reward weights and from removing or heavily down-weighting the cue-analysis term, demonstrating its necessity for the observed gains. These results will be added to §3.2 and the experimental section. revision: yes

Circularity Check

0 steps flagged

Empirical agent method with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical system (DepthAgent) that trains a VLM-based agent via multi-reward RL to select/fuse frozen depth experts. All reported gains are measured on external benchmarks (perspective, fisheye, panoramic) against baselines; no equations, uniqueness theorems, or performance metrics are defined in terms of the method's own fitted parameters or prior self-citations. The observed complementarity is an experimental finding, not a constructed identity. This is a standard empirical ML paper whose central claims rest on held-out test metrics rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption of strong sample-wise complementarity between depth experts and on several design choices in the RL reward scheme whose values are not reported.

free parameters (1)

multi-reward weights
Weights balancing tool execution, analysis quality, depth accuracy, and efficiency are introduced in the RL fine-tuning scheme.

axioms (1)

domain assumption Depth experts exhibit strong sample-wise complementarity correlated with camera geometry
Stated as an observation motivating the agent design.

invented entities (1)

DepthAgent no independent evidence
purpose: Vision-language agent for adaptive expert selection and fusion
New system introduced to perform the selection task.

pith-pipeline@v0.9.0 · 5773 in / 1220 out tokens · 28217 ms · 2026-05-25T04:26:58.064363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 20 internal anchors

[1]

Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions

Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023

work page 2023
[2]

Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation

Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Alvarez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727–3737, 2021

work page 2021
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[7]

Atm: Action temporality modeling for video question answering

Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action temporality modeling for video question answering. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4886–4895, 2023

work page 2023
[8]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

On the suitability of reinforcement fine-tuning to visual tasks

Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InCVPR, 2025

work page 2025
[10]

Multi-resolution monocular depth map fusion by self-supervised gradient-based composition

Yaqiao Dai, Renjiao Yi, Chenyang Zhu, Hongjun He, and Kai Xu. Multi-resolution monocular depth map fusion by self-supervised gradient-based composition. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023
[11]

Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

work page 2022
[12]

Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020

work page 2020
[13]

Simfir: A simple framework for fisheye image rectification with self-supervised representation learning

Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learning. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

work page 2023
[14]

Unidac: Universal metric depth estimation for any camera

Girish Chandar Ganesan, Yuliang Guo, Liu Ren, and Xiaoming Liu. Unidac: Universal metric depth estimation for any camera. InCVPR, 2026

work page 2026
[15]

Are we ready for autonomous driving? The KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012
[16]

A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, et al. A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

work page arXiv 2004
[17]

3d packing for self- supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[18]

Towards zero-shot scale-aware monocular depth estimation

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9233–9243, 2023. 10

work page 2023
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

Xiao Guo, Jie Zhu, Anil Jain, and Xiaoming Liu. On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

work page arXiv 2026
[21]

Depth any camera: Zero-shot metric depth estimation from any camera

Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025

work page 2025
[22]

A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

Dong Han, Beni Mulyana, Vladimir Stankovic, and Samuel Cheng. A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

work page 2023
[23]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero- shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

work page 2024
[24]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

work page 2021
[27]

Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021

work page 2021
[28]

Reinforcement learning: A survey

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996

work page 1996
[29]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[30]

Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020

Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020. doi: 10.1016/j.cviu.2019.102877

work page doi:10.1016/j.cviu.2019.102877 2020
[31]

Deviant: Depth equivariant network for monocular 3d object detection

Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. Deviant: Depth equivariant network for monocular 3d object detection. InECCV, 2022

work page 2022
[32]

Charm3r: Towards unseen camera height robust monocular 3d detector

Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, and Xiaoming Liu. Charm3r: Towards unseen camera height robust monocular 3d detector. InICCV, 2025

work page 2025
[33]

Simmlm: A simple framework for multi-modal learning with missing modality

Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24068–24077, 2025

work page 2025
[34]

Omnifusion: 360 monocular depth estimation via geometry-aware fusion

Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estimation via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022

work page 2022
[35]

Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

Feng Liu, Nicholas Chimitt, Lanqing Guo, Jitesh Jain, Aditya Kane, Minchul Kim, Wes Robbins, Yiyang Su, Dingqiang Ye, Xingguang Zhang, et al. Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

work page arXiv 2025
[36]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

work page arXiv 2025
[37]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

work page 2025
[40]

Routing to the expert: Efficient reward-guided ensemble of large language models

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

work page 2024
[41]

Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

work page 2022
[42]

Deep learning, reinforcement learning, and world models.Neural Networks, 2022

Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 2022

work page 2022
[43]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InThe European Conference on Computer Vision (ECCV), 2012

work page 2012
[44]

The fourth monocular depth estimation challenge

Anton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russel, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, et al. The fourth monocular depth estimation challenge. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[45]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

work page 2021
[47]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Unidepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

work page 2024
[49]

Unik3d: Universal camera monocular 3d estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025

work page 2025
[50]

Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

work page 2025
[51]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

work page 2020
[53]

Adaptive co-teaching for unsupervised monocular depth estimation

Weisong Ren, Lijun Wang, Yongri Piao, Miao Zhang, Huchuan Lu, and Ting Liu. Adaptive co-teaching for unsupervised monocular depth estimation. InEuropean Conference on Computer Vision, pages 89–105. Springer, 2022

work page 2022
[54]

360monodepth: High-resolution 360 monocular depth estimation

Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth estimation. in 2022 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 12

work page 2022
[56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed...

work page 2022
[59]

Learning spherical convolution for fast features from 360° imagery

Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360° imagery. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Dece...

work page 2017
[60]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang- Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998
[62]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025
[63]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019

work page 2019
[64]

Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, and Xiaoqiang Li. Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

work page arXiv 2025
[65]

Generating and exploiting probabilistic monocular depth estimates

Zhihao Xia, Patrick Sullivan, and Ayan Chakrabarti. Generating and exploiting probabilistic monocular depth estimates. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 65–74, 2020

work page 2020
[66]

Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. 2024

work page 2024
[67]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024

work page 2024
[68]

Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

work page 2024
[69]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

work page 2023
[71]

Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts

Hanrong Ye and Dan Xu. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. InProceedings of the IEEE/CVF international conference on computer vision, pages 21828–21837, 2023

work page 2023
[72]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13

work page 2023
[73]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021

work page 2021
[74]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023

work page 2023
[75]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation

Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

work page 2023
[77]

Taskonomy: Disentangling task transfer learning

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

work page 2018
[78]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

work page 2020
[80]

Unleashing the power of chain-of-prediction for monocular 3d object detection

Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, and Xiaoming Liu. Unleashing the power of chain-of-prediction for monocular 3d object detection. InCVPR, 2026

work page 2026

Showing first 80 references.

[1] [1]

Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions

Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023

work page 2023

[2] [2]

Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation

Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Alvarez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727–3737, 2021

work page 2021

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[7] [7]

Atm: Action temporality modeling for video question answering

Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action temporality modeling for video question answering. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4886–4895, 2023

work page 2023

[8] [8]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

On the suitability of reinforcement fine-tuning to visual tasks

Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InCVPR, 2025

work page 2025

[10] [10]

Multi-resolution monocular depth map fusion by self-supervised gradient-based composition

Yaqiao Dai, Renjiao Yi, Chenyang Zhu, Hongjun He, and Kai Xu. Multi-resolution monocular depth map fusion by self-supervised gradient-based composition. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023

[11] [11]

Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022

work page 2022

[12] [12]

Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d interaction with depth maps for mobile augmented reality. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020

work page 2020

[13] [13]

Simfir: A simple framework for fisheye image rectification with self-supervised representation learning

Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learning. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

work page 2023

[14] [14]

Unidac: Universal metric depth estimation for any camera

Girish Chandar Ganesan, Yuliang Guo, Liu Ren, and Xiaoming Liu. Unidac: Universal metric depth estimation for any camera. InCVPR, 2026

work page 2026

[15] [15]

Are we ready for autonomous driving? The KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012

[16] [16]

A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, et al. A2d2: Audi autonomous driving dataset.arXiv preprint arXiv:2004.06320, 2020

work page arXiv 2004

[17] [17]

3d packing for self- supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[18] [18]

Towards zero-shot scale-aware monocular depth estimation

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9233–9243, 2023. 10

work page 2023

[19] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

Xiao Guo, Jie Zhu, Anil Jain, and Xiaoming Liu. On the holistic approach for detecting human image forgery.arXiv preprint arXiv:2601.04715, 2026

work page arXiv 2026

[21] [21]

Depth any camera: Zero-shot metric depth estimation from any camera

Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025

work page 2025

[22] [22]

A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

Dong Han, Beni Mulyana, Vladimir Stankovic, and Samuel Cheng. A survey on deep reinforcement learning algorithms for robotic manipulation.Sensors, 2023

work page 2023

[23] [23]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero- shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

work page 2024

[24] [24]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters (RA-L), 6(2):1519–1526, 2021

work page 2021

[27] [27]

Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021

work page 2021

[28] [28]

Reinforcement learning: A survey

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996

work page 1996

[29] [29]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[30] [30]

Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020

Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020. doi: 10.1016/j.cviu.2019.102877

work page doi:10.1016/j.cviu.2019.102877 2020

[31] [31]

Deviant: Depth equivariant network for monocular 3d object detection

Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. Deviant: Depth equivariant network for monocular 3d object detection. InECCV, 2022

work page 2022

[32] [32]

Charm3r: Towards unseen camera height robust monocular 3d detector

Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, and Xiaoming Liu. Charm3r: Towards unseen camera height robust monocular 3d detector. InICCV, 2025

work page 2025

[33] [33]

Simmlm: A simple framework for multi-modal learning with missing modality

Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24068–24077, 2025

work page 2025

[34] [34]

Omnifusion: 360 monocular depth estimation via geometry-aware fusion

Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estimation via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022

work page 2022

[35] [35]

Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

Feng Liu, Nicholas Chimitt, Lanqing Guo, Jitesh Jain, Aditya Kane, Minchul Kim, Wes Robbins, Yiyang Su, Dingqiang Ye, Xingguang Zhang, et al. Person recognition at altitude and range: Fusion of face, body shape and gait.arXiv preprint arXiv:2505.04616, 2025

work page arXiv 2025

[36] [36]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

work page arXiv 2025

[37] [37]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

work page 2025

[40] [40]

Routing to the expert: Efficient reward-guided ensemble of large language models

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

work page 2024

[41] [41]

Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 2022

work page 2022

[42] [42]

Deep learning, reinforcement learning, and world models.Neural Networks, 2022

Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 2022

work page 2022

[43] [43]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InThe European Conference on Computer Vision (ECCV), 2012

work page 2012

[44] [44]

The fourth monocular depth estimation challenge

Anton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russel, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, et al. The fourth monocular depth estimation challenge. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025

[45] [45]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021

work page 2021

[47] [47]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Unidepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

work page 2024

[49] [49]

Unik3d: Universal camera monocular 3d estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025

work page 2025

[50] [50]

Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

work page 2025

[51] [51]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

work page 2020

[53] [53]

Adaptive co-teaching for unsupervised monocular depth estimation

Weisong Ren, Lijun Wang, Yongri Piao, Miao Zhang, Huchuan Lu, and Ting Liu. Adaptive co-teaching for unsupervised monocular depth estimation. InEuropean Conference on Computer Vision, pages 89–105. Springer, 2022

work page 2022

[54] [54]

360monodepth: High-resolution 360 monocular depth estimation

Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth estimation. in 2022 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[55] [55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 12

work page 2022

[56] [56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed...

work page 2022

[59] [59]

Learning spherical convolution for fast features from 360° imagery

Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360° imagery. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Dece...

work page 2017

[60] [60]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang- Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998

[62] [62]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025

[63] [63]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019

work page 2019

[64] [64]

Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, and Xiaoqiang Li. Tove: Efficient vision-language learning via knowledge transfer from vision experts.arXiv preprint arXiv:2504.00691, 2025

work page arXiv 2025

[65] [65]

Generating and exploiting probabilistic monocular depth estimates

Zhihao Xia, Patrick Sullivan, and Ayan Chakrabarti. Generating and exploiting probabilistic monocular depth estimates. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 65–74, 2020

work page 2020

[66] [66]

Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. 2024

work page 2024

[67] [67]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024

work page 2024

[68] [68]

Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

work page 2024

[69] [69]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

work page 2023

[71] [71]

Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts

Hanrong Ye and Dan Xu. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. InProceedings of the IEEE/CVF international conference on computer vision, pages 21828–21837, 2023

work page 2023

[72] [72]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13

work page 2023

[73] [73]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021

work page 2021

[74] [74]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023

work page 2023

[75] [75]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation

Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023

work page 2023

[77] [77]

Taskonomy: Disentangling task transfer learning

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

work page 2018

[78] [78]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 2020

work page 2020

[80] [80]

Unleashing the power of chain-of-prediction for monocular 3d object detection

Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, and Xiaoming Liu. Unleashing the power of chain-of-prediction for monocular 3d object detection. InCVPR, 2026

work page 2026