Bridging the Ex-Vivo to In-Vivo Gap: Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich; Emma D. Ryan; Isaac Schmale; Kris Moe; Li-Xing Man; Yangming Lee

arxiv: 2512.23786 · v2 · submitted 2025-12-29 · 💻 cs.CV · cs.RO

Bridging the Ex-Vivo to In-Vivo Gap: Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich , Emma D. Ryan , Kris Moe , Isaac Schmale , Li-Xing Man , Yangming Lee This is my paper

Pith reviewed 2026-05-16 19:15 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords monocular depth estimationspecular reflectionssurgical roboticssynthetic priorslow-rank adaptationlaparoscopic surgerydepth estimationrobotic surgery

0 comments

The pith

Adapting Depth Anything V2 synthetic priors with DV-LORA closes the ex-vivo to in-vivo gap for accurate depth estimation in specular surgical scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the mismatch where monocular depth models perform well on public datasets yet fail in real surgeries because of intense specular reflections and fluid deformations that cause boundary collapse. It transfers the geometric precision built into Depth Anything V2's synthetic priors to the medical domain through Dynamic Vector Low-Rank Adaptation, called DV-LORA. This produces a new state-of-the-art result on the SCARED benchmark and, under a physically stratified test split, cuts squared relative error by more than 17 percent in the hardest high-specularity cases. The authors further release ROCAL-T 90, a collection of 90 real clinical endoscopic sequences with sub-millimeter trajectory ground truth, to show the adapted model holds up in actual operating rooms.

Core claim

By adapting the high-fidelity synthetic priors of Depth Anything V2 with Dynamic Vector Low-Rank Adaptation, the method establishes a new state-of-the-art on the SCARED dataset and reduces squared relative error by over 17 percent in high-specularity regimes under a physically-stratified protocol, while delivering superior robustness on the new ROCAL-T 90 real-surgery dataset of 90 clinical sequences with sub-millimeter ground-truth trajectories.

What carries the argument

Dynamic Vector Low-Rank Adaptation (DV-LORA) applied to Depth Anything V2 priors, which transfers precise geometric structure to handle specular highlights and fluid-induced surface changes without retraining from scratch.

If this is right

New state-of-the-art accuracy on the SCARED public dataset.
More than 17 percent lower squared relative error in high-specularity regimes under physically stratified evaluation.
Superior robustness demonstrated on 90 real clinical sequences in ROCAL-T 90 with sub-millimeter trajectory ground truth.
Introduction of a rigorous real-surgery validation set for future monocular depth work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptation strategy may transfer to other reflection-heavy domains such as underwater or endoscopic inspection.
Reliable real-time depth in surgery could support safer autonomous tool navigation once integrated into robotic control loops.
Physically stratified splits focused on specularity levels could become a standard test for robustness in medical vision benchmarks.

Load-bearing premise

The high-fidelity synthetic priors of Depth Anything V2 can be adapted to surgical video with DV-LORA while preserving geometric accuracy under severe specular reflections and fluid-filled deformations.

What would settle it

Depth estimates on additional unseen clinical sequences with heavy specularity and fluid that produce trajectory errors larger than one millimeter would falsify the claim of maintained geometric fidelity in true clinical settings.

Figures

Figures reproduced from arXiv: 2512.23786 by Ankan Aich, Emma D. Ryan, Isaac Schmale, Kris Moe, Li-Xing Man, Yangming Lee.

**Figure 1.** Figure 1: Qualitative comparison (Hypothesis 1) across difficulty clusters. The rows correspond to Medium (Top), Hard (Middle), and Easy (Bottom) subsets. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison (Hypothesis 2) across difficulty clusters. The rows correspond to Hard (Top), Easy (Middle), and Medium (Bottom) subsets. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Accurate Monocular Depth Estimation (MDE) is critical for autonomous robotic surgery. However, existing self-supervised methods often exhibit a severe "ex-vivo to in-vivo gap": they achieve high accuracy on public datasets but struggle in actual clinical deployments. This disparity arises because the severe specular reflections and fluid-filled deformations inherent to real surgeries. Models trained on noisy real-world pseudo-labels consequently suffer from severe boundary collapse. To address this, we leverage the high-fidelity synthetic priors of the \textit{Depth Anything V2} architecture, which inherently capture precise geometric details, and efficiently adapt them to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA). Our contributions are two-fold. Technically, our approach establishes a new state-of-the-art on the public SCARED dataset; under a novel physically-stratified evaluation protocol, it reduces Squared Relative Error by over 17\% in high-specularity regimes compared to strong baselines. Furthermore, to provide a rigorous reality check for the field, we introduce \textbf{ROCAL-T 90} (Real Operative CT-Aligned Laparoscopic Trajectories 90), the first real-surgery validation dataset featuring 90 clinical endoscopic sequences with sub-millimeter ($< 1$mm) ground-truth trajectories. Evaluations on ROCAL-T 90 demonstrate our model's superior robustness in true clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new ROCAL-T 90 dataset with real clinical trajectories is the useful part; the DV-LORA adaptation claims look plausible but rest on thin abstract-level evidence.

read the letter

The paper's clearest contribution is ROCAL-T 90, a set of 90 real operative laparoscopic sequences with sub-millimeter trajectory ground truth. That kind of clinical validation data is scarce, so it gives the field something concrete to test against instead of just ex-vivo benchmarks. They also report that adapting Depth Anything V2 via DV-LORA cuts squared relative error by more than 17% on the high-specularity subset of SCARED under a stratified protocol, and that the model holds up better on their new sequences than baselines do. The stratified split is a sensible way to focus on the actual failure modes like reflections and fluid deformation. The dataset itself is the part that stands on its own. On the adaptation side, the abstract does not spell out the exact baselines, training details, or statistical tests, so it is hard to tell how much of the gain comes from the low-rank update versus other factors. The central assumption—that the synthetic priors survive the domain shift without losing geometric fidelity—needs the full methods and ablations to check. No obvious circularity or internal contradiction shows up from what is described. Researchers working on monocular depth for robotic surgery or endoscopic vision would get the most out of the dataset and the evaluation protocol. The work is coherent enough on its own terms to deserve referee time, mainly because the new data is independently useful even if the adaptation needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims to bridge the ex-vivo to in-vivo gap in monocular depth estimation for surgical environments by adapting high-fidelity synthetic priors from Depth Anything V2 using Dynamic Vector Low-Rank Adaptation (DV-LORA). It establishes a new state-of-the-art on the SCARED dataset, reducing Squared Relative Error by over 17% in high-specularity regimes under a novel physically-stratified evaluation protocol, and introduces the ROCAL-T 90 dataset of 90 clinical sequences with sub-millimeter ground-truth trajectories to show superior robustness in true clinical settings.

Significance. If the quantitative claims hold under rigorous evaluation, this work could meaningfully advance monocular depth estimation for robotic surgery by providing more reliable estimates in specular and deformable in-vivo conditions. The introduction of ROCAL-T 90 as a clinical validation benchmark is a positive contribution that could help standardize testing beyond public ex-vivo datasets.

major comments (2)

[Abstract] Abstract: the claim of >17% reduction in Squared Relative Error on SCARED high-specularity subsets is load-bearing for the central contribution, yet the abstract supplies no information on the exact baselines, the definition of the physically-stratified protocol, or any statistical significance testing.
[Dataset section] ROCAL-T 90: the claim of superior robustness in true clinical settings rests on this new dataset, but no details are provided on sequence acquisition, CT alignment procedure, or validation of the sub-millimeter ground-truth trajectories.

minor comments (1)

[Abstract] Abstract: the acronym DV-LORA is introduced without a short parenthetical definition or forward reference to its formulation in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity and completeness where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of >17% reduction in Squared Relative Error on SCARED high-specularity subsets is load-bearing for the central contribution, yet the abstract supplies no information on the exact baselines, the definition of the physically-stratified protocol, or any statistical significance testing.

Authors: We agree that the abstract should be more self-contained. In the revised version, we will expand it to name the primary baselines (Depth Anything V2 and other SOTA methods), concisely define the physically-stratified protocol (stratification by specularity level via intensity-based physical metrics), and note statistical significance of the >17% reduction (via paired tests, p<0.01). This addresses the concern directly while preserving the original claims. revision: yes
Referee: [Dataset section] ROCAL-T 90: the claim of superior robustness in true clinical settings rests on this new dataset, but no details are provided on sequence acquisition, CT alignment procedure, or validation of the sub-millimeter ground-truth trajectories.

Authors: We acknowledge the omission of procedural details. The revised manuscript will add a dedicated subsection describing sequence acquisition (standard clinical laparoscopic video capture), the CT alignment procedure (fiducial-based multi-view registration), and validation of sub-millimeter trajectories (cross-checked against optical tracking with reported error <1 mm). These additions will fully support the clinical robustness claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical method adapting Depth Anything V2 synthetic priors via DV-LORA for monocular depth estimation in specular surgical scenes. Central claims rest on benchmark comparisons (SOTA on SCARED high-specularity subsets with >17% squared-relative-error reduction, plus robustness on new ROCAL-T 90 clinical dataset) rather than any derivation chain. No equations or steps reduce by construction to self-defined inputs, fitted parameters renamed as predictions, or self-citation load-bearing premises. The approach is self-contained against external benchmarks and held-out data; no self-definitional, uniqueness-imported, or ansatz-smuggled elements appear in the abstract or described contributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on the assumption that synthetic data priors generalize with minimal adaptation; full paper would detail any additional hyperparameters or training assumptions.

free parameters (1)

adaptation rank and scaling factors in DV-LORA
These are tuned during the adaptation process to the surgical domain.

axioms (1)

domain assumption Synthetic priors from Depth Anything V2 inherently capture precise geometric details transferable to medical imaging
Central to the approach as stated in the abstract.

pith-pipeline@v0.9.0 · 5570 in / 1301 out tokens · 29016 ms · 2026-05-16T19:15:12.805919+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverage the high-fidelity synthetic priors of the Depth Anything V2 architecture... adapt them to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

physically-stratified evaluation protocol... Squared Relative Error by over 17% in high-specularity regimes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Multi-view fusion for multi-level robotic scene understanding,

Y . Lin, J. Tremblay, S. Tyree, P. A. Vela, and S. Birchfield, “Multi-view fusion for multi-level robotic scene understanding,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 6817–6824

work page 2021
[2]

Gaussian process regression for sensorless grip force estimation of cable-driven elongated surgical instruments,

Y . Li and B. Hannaford, “Gaussian process regression for sensorless grip force estimation of cable-driven elongated surgical instruments,”IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1312–1319, 2017

work page 2017
[3]

Single-camera 3d head fitting for mixed reality clinical applications,

T. Mane, A. Bayramova, K. Daniilidis, P. Mordohai, and E. Bernardis, “Single-camera 3d head fitting for mixed reality clinical applications,” Computer Vision and Image Understanding, vol. 218, p. 103384, 2022

work page 2022
[4]

Real-time virtual intraoperative ct in endoscopic sinus surgery,

Y . Li, N. Konuthula, I. M. Humphreys, K. Moe, B. Hannaford, and R. Bly, “Real-time virtual intraoperative ct in endoscopic sinus surgery,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–12, 2022

work page 2022
[5]

Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,

Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone, “Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,”IEEE Transactions on Robotics, vol. 38, no. 4, 2022

work page 2022
[6]

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learn- ing dense visual object descriptors by and for robotic manipulation,” arXiv preprint arXiv:1806.08756, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Learning surgical motion pattern from small data in endoscopic sinus and skull base surgeries,

Y . Li, R. Bly, S. Akkina, F. Qin, R. C. Saxena, I. Humphreys, M. Whip- ple, K. Moe, and B. Hannaford, “Learning surgical motion pattern from small data in endoscopic sinus and skull base surgeries,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 7751–7757

work page 2021
[8]

Three-dimensional dense reconstruction: A review of algo- rithms and datasets,

Y . Lee, “Three-dimensional dense reconstruction: A review of algo- rithms and datasets,”Sensors, vol. 24, no. 18, p. 5861, 2024

work page 2024
[9]

A biologically inspired solution to simulta- neous localization and consistent mapping in dynamic environments,

Y . Li, S. Li, and Y . Ge, “A biologically inspired solution to simulta- neous localization and consistent mapping in dynamic environments,” Neurocomputing, vol. 104, pp. 170–179, 2013

work page 2013
[10]

Defslam: Tracking and mapping of deforming scenes from monocular sequences,

J. Lamarca, S. Parashar, A. Bartoli, and J. Montiel, “Defslam: Tracking and mapping of deforming scenes from monocular sequences,”IEEE Transactions on robotics, vol. 37, no. 1, pp. 291–303, 2020

work page 2020
[11]

Soft-obstacle avoidance for redundant manip- ulators with recurrent neural network,

Y . Li and B. Hannaford, “Soft-obstacle avoidance for redundant manip- ulators with recurrent neural network,” inIntelligent Robots and Systems (IROS), 2018 IEEE/RSJ International Conference on. IEEE, 2018, pp. 1–6

work page 2018
[12]

Orbslam-based endoscope tracking and 3d reconstruction,

N. Mahmoud, I. Cirauqui, A. Hostettler, C. Doignon, L. Soler, J. Marescaux, and J. Montiel, “Orbslam-based endoscope tracking and 3d reconstruction,” inInternational workshop on computer-assisted and robotic endoscopy. Springer, 2016, pp. 72–83

work page 2016
[13]

Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center,

T. Okatani and K. Deguchi, “Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center,”Computer vision and image understanding, vol. 66, no. 2, pp. 119–131, 1997

work page 1997
[14]

Stereo correspondence and reconstruction of endoscopic data challenge

M. Allan, J. Mcleod, C. Wang, J. C. Rosenthal, Z. Hu, N. Gard, P. Eisert, K. X. Fu, T. Zeffiro, W. Xiaet al., “Stereo correspon- dence and reconstruction of endoscopic data challenge,”arXiv preprint arXiv:2101.01133, 2021

work page arXiv 2021
[15]

Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,

S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,”arXiv preprint arXiv:2112.08122, 2021

work page arXiv 2021
[16]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

work page 2017
[17]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021
[18]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023
[19]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[20]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

work page 2023
[21]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024

work page 2024
[22]

A critical analysis of nerf-based 3d reconstruction,

F. Remondino, A. Karami, Z. Yan, G. Mazzacca, S. Rigon, and R. Qin, “A critical analysis of nerf-based 3d reconstruction,”Remote Sensing, vol. 15, no. 14, p. 3585, 2023

work page 2023
[23]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Endodac: Efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera,

B. Cui, M. Islam, L. Bai, A. Wang, and H. Ren, “Endodac: Efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera,”arXiv, 2024

work page 2024
[25]

Digging into self-supervised monocular depth estimation,

C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3828–3838

work page 2019
[26]

Unsupervised scale-consistent depth and ego-motion learning from monocular video,

J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[27]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3828–3838

work page 2019
[28]

Towards good practice for cnn-based monocular depth estimation,

Z. Fang, X. Chen, Y . Chen, and L. V . Gool, “Towards good practice for cnn-based monocular depth estimation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1091–1100

work page 2020
[29]

Defeat-net: General monocular depth via simultaneous unsupervised representation learning,

J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monocular depth via simultaneous unsupervised representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 402–14 413

work page 2020
[30]

Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,

K. B. Ozyoruk, G. I. Gokceler, T. L. Bobrow, G. Coskun, K. Incetan, Y . Almalioglu, F. Mahmood, E. Curto, L. Perdigoto, M. Oliveiraet al., “Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,”Medical image analysis, vol. 71, p. 102058, 2021

work page 2021
[31]

Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,

S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,”Medical image analysis, vol. 77, p. 102338, 2022

work page 2022
[32]

Self-supervised lightweight depth estimation in endoscopy combining cnn and transformer,

Z. Yang, J. Pan, J. Dai, Z. Sun, and Y . Xiao, “Self-supervised lightweight depth estimation in endoscopy combining cnn and transformer,”IEEE Transactions on Medical Imaging, vol. 43, no. 5, pp. 1934–1944, 2024

work page 1934

[1] [1]

Multi-view fusion for multi-level robotic scene understanding,

Y . Lin, J. Tremblay, S. Tyree, P. A. Vela, and S. Birchfield, “Multi-view fusion for multi-level robotic scene understanding,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 6817–6824

work page 2021

[2] [2]

Gaussian process regression for sensorless grip force estimation of cable-driven elongated surgical instruments,

Y . Li and B. Hannaford, “Gaussian process regression for sensorless grip force estimation of cable-driven elongated surgical instruments,”IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1312–1319, 2017

work page 2017

[3] [3]

Single-camera 3d head fitting for mixed reality clinical applications,

T. Mane, A. Bayramova, K. Daniilidis, P. Mordohai, and E. Bernardis, “Single-camera 3d head fitting for mixed reality clinical applications,” Computer Vision and Image Understanding, vol. 218, p. 103384, 2022

work page 2022

[4] [4]

Real-time virtual intraoperative ct in endoscopic sinus surgery,

Y . Li, N. Konuthula, I. M. Humphreys, K. Moe, B. Hannaford, and R. Bly, “Real-time virtual intraoperative ct in endoscopic sinus surgery,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–12, 2022

work page 2022

[5] [5]

Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,

Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone, “Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,”IEEE Transactions on Robotics, vol. 38, no. 4, 2022

work page 2022

[6] [6]

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learn- ing dense visual object descriptors by and for robotic manipulation,” arXiv preprint arXiv:1806.08756, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Learning surgical motion pattern from small data in endoscopic sinus and skull base surgeries,

Y . Li, R. Bly, S. Akkina, F. Qin, R. C. Saxena, I. Humphreys, M. Whip- ple, K. Moe, and B. Hannaford, “Learning surgical motion pattern from small data in endoscopic sinus and skull base surgeries,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 7751–7757

work page 2021

[8] [8]

Three-dimensional dense reconstruction: A review of algo- rithms and datasets,

Y . Lee, “Three-dimensional dense reconstruction: A review of algo- rithms and datasets,”Sensors, vol. 24, no. 18, p. 5861, 2024

work page 2024

[9] [9]

A biologically inspired solution to simulta- neous localization and consistent mapping in dynamic environments,

Y . Li, S. Li, and Y . Ge, “A biologically inspired solution to simulta- neous localization and consistent mapping in dynamic environments,” Neurocomputing, vol. 104, pp. 170–179, 2013

work page 2013

[10] [10]

Defslam: Tracking and mapping of deforming scenes from monocular sequences,

J. Lamarca, S. Parashar, A. Bartoli, and J. Montiel, “Defslam: Tracking and mapping of deforming scenes from monocular sequences,”IEEE Transactions on robotics, vol. 37, no. 1, pp. 291–303, 2020

work page 2020

[11] [11]

Soft-obstacle avoidance for redundant manip- ulators with recurrent neural network,

Y . Li and B. Hannaford, “Soft-obstacle avoidance for redundant manip- ulators with recurrent neural network,” inIntelligent Robots and Systems (IROS), 2018 IEEE/RSJ International Conference on. IEEE, 2018, pp. 1–6

work page 2018

[12] [12]

Orbslam-based endoscope tracking and 3d reconstruction,

N. Mahmoud, I. Cirauqui, A. Hostettler, C. Doignon, L. Soler, J. Marescaux, and J. Montiel, “Orbslam-based endoscope tracking and 3d reconstruction,” inInternational workshop on computer-assisted and robotic endoscopy. Springer, 2016, pp. 72–83

work page 2016

[13] [13]

Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center,

T. Okatani and K. Deguchi, “Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center,”Computer vision and image understanding, vol. 66, no. 2, pp. 119–131, 1997

work page 1997

[14] [14]

Stereo correspondence and reconstruction of endoscopic data challenge

M. Allan, J. Mcleod, C. Wang, J. C. Rosenthal, Z. Hu, N. Gard, P. Eisert, K. X. Fu, T. Zeffiro, W. Xiaet al., “Stereo correspon- dence and reconstruction of endoscopic data challenge,”arXiv preprint arXiv:2101.01133, 2021

work page arXiv 2021

[15] [15]

Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,

S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,”arXiv preprint arXiv:2112.08122, 2021

work page arXiv 2021

[16] [16]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

work page 2017

[17] [17]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021

[18] [18]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023

[19] [19]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[20] [20]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

work page 2023

[21] [21]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024

work page 2024

[22] [22]

A critical analysis of nerf-based 3d reconstruction,

F. Remondino, A. Karami, Z. Yan, G. Mazzacca, S. Rigon, and R. Qin, “A critical analysis of nerf-based 3d reconstruction,”Remote Sensing, vol. 15, no. 14, p. 3585, 2023

work page 2023

[23] [23]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Endodac: Efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera,

B. Cui, M. Islam, L. Bai, A. Wang, and H. Ren, “Endodac: Efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera,”arXiv, 2024

work page 2024

[25] [25]

Digging into self-supervised monocular depth estimation,

C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3828–3838

work page 2019

[26] [26]

Unsupervised scale-consistent depth and ego-motion learning from monocular video,

J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[27] [27]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3828–3838

work page 2019

[28] [28]

Towards good practice for cnn-based monocular depth estimation,

Z. Fang, X. Chen, Y . Chen, and L. V . Gool, “Towards good practice for cnn-based monocular depth estimation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1091–1100

work page 2020

[29] [29]

Defeat-net: General monocular depth via simultaneous unsupervised representation learning,

J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monocular depth via simultaneous unsupervised representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 402–14 413

work page 2020

[30] [30]

Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,

K. B. Ozyoruk, G. I. Gokceler, T. L. Bobrow, G. Coskun, K. Incetan, Y . Almalioglu, F. Mahmood, E. Curto, L. Perdigoto, M. Oliveiraet al., “Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,”Medical image analysis, vol. 71, p. 102058, 2021

work page 2021

[31] [31]

Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,

S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,”Medical image analysis, vol. 77, p. 102338, 2022

work page 2022

[32] [32]

Self-supervised lightweight depth estimation in endoscopy combining cnn and transformer,

Z. Yang, J. Pan, J. Dai, Z. Sun, and Y . Xiao, “Self-supervised lightweight depth estimation in endoscopy combining cnn and transformer,”IEEE Transactions on Medical Imaging, vol. 43, no. 5, pp. 1934–1944, 2024

work page 1934