pith. sign in

arxiv: 2605.16628 · v1 · pith:IBNWD2Z3new · submitted 2026-05-15 · 💻 cs.CV

SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords endoscopic depth estimationcamera pose correctionstructure from motionSCARED datasetRGB-D dataCOLMAPscale alignmentsurgical vision
0
0 comments X

The pith

A pipeline using COLMAP and scale alignment fixes camera poses in the SCARED endoscopic dataset, turning 35 reliable RGB-D pairs into 17,135.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome pose errors in the SCARED benchmark that stem from robot kinematics on non-keyframe images. It re-estimates all camera poses via structure-from-motion, then aligns the resulting reconstructions to metric scale using the existing ground-truth depths at keyframes. A sympathetic reader cares because depth estimation models for surgery have been trained and tested on a tiny, error-prone subset of the available data. The corrected version lets researchers use full sequences without introducing large labeling artifacts.

Core claim

Applying COLMAP to re-estimate camera poses across all frames, followed by a scale-recovery step that registers the reconstructions to the ground-truth keyframe depth maps, yields metric-accurate poses for non-keyframe images and expands the set of reliable RGB-D pairs from 35 to 17,135.

What carries the argument

The two-stage pipeline of COLMAP structure-from-motion pose estimation followed by scale recovery that aligns each reconstruction to metric space using the ground-truth keyframe depth maps.

If this is right

  • Full video sequences become usable for supervised training of monocular and stereo depth networks.
  • Benchmark results on SCARED can now reflect performance across thousands of frames rather than a few dozen.
  • Pose-corrected data supports evaluation of methods that require temporally consistent geometry.
  • The same correction process can be rerun whenever improved structure-from-motion tools appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The corrected dataset may reduce the domain gap when models trained on SCARED-C are deployed on real surgical video that also suffers from specular highlights.
  • Future datasets could adopt the same keyframe-plus-SfM pattern to avoid relying on robot kinematics for pose labels.
  • Downstream tasks such as 3D reconstruction of surgical scenes or instrument tracking could directly benefit from the denser metric depth maps.

Load-bearing premise

COLMAP can recover sufficiently accurate relative camera poses from endoscopic sequences despite low texture, specular reflections, and changing illumination.

What would settle it

A direct comparison showing that stereo disparity errors or monocular depth estimation accuracy on the corrected poses is no better than (or worse than) the original kinematic poses would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.16628 by Adam Schmidt, Jie Ying Wu, John J. Han, Max Allan, Omid Mohareri.

Figure 1
Figure 1. Figure 1: Samples from the original SCARED dataset (left) and the corrected SCARED-C dataset (right). The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

The SCARED dataset is a widely used benchmark for endoscopic depth estimation, offering ground-truth 3D reconstructions captured with a structured light sensor. However, the depth maps for non-keyframe images rely on robot kinematics that introduce substantial pose errors, limiting the reliably labeled portion of the dataset to 35 keyframes. We present SCARED-C, a corrected version of the SCARED dataset that expands the number of reliable RGB-D pairs from 35 to 17,135. Our pipeline applies COLMAP, a Structure-from-Motion system, to re-estimate camera poses for all frames, followed by a scale recovery step that aligns the resulting reconstructions to metric space using the ground-truth keyframe depth maps. We validate the corrected poses through (1) stereo disparity evaluation and (2) monocular depth estimation experiments. The corrected dataset and code are publicly released to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SCARED-C, a corrected version of the SCARED endoscopic dataset. It uses COLMAP Structure-from-Motion to re-estimate camera poses across all frames, followed by a global scale recovery step that aligns the reconstruction to metric space using the 35 ground-truth keyframe depth maps. This expands the set of reliable RGB-D pairs from 35 to 17,135. The corrected poses are validated via (1) stereo disparity evaluation and (2) monocular depth estimation experiments, with public release of the dataset and code.

Significance. If the corrected poses prove accurate, the work would meaningfully enlarge the usable supervised training data for endoscopic depth estimation, directly addressing the robot-kinematics pose errors that currently restrict the original SCARED benchmark to only 35 keyframes. The pipeline demonstrates a practical way to leverage off-the-shelf SfM tools for domain-specific pose correction without introducing new learned parameters.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (pipeline description): the central claim that the scale-aligned COLMAP poses yield metric-accurate results for the additional 17,100 non-keyframe frames rests on the unquantified assumption that COLMAP recovers sufficiently accurate relative poses despite low texture, specular reflections, and varying illumination. No mean reprojection error, bundle-adjustment success rate, or fraction of retained frames is reported, so it is impossible to assess whether drift or dropped frames between keyframes undermine the expansion claim.
  2. [§4] §4 (validation): the two validation experiments are described at a high level but supply no quantitative error tables, ablation results on the scale-recovery step, or failure-case analysis. Without these, the evidence that the corrected poses improve downstream depth estimation remains insufficient to support the claim of a reliably expanded benchmark.
minor comments (1)
  1. [Abstract] The abstract would benefit from a single sentence stating the original SCARED limitation (robot-kinematics pose errors on non-keyframes) before describing the correction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the manuscript to provide the requested quantitative details.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (pipeline description): the central claim that the scale-aligned COLMAP poses yield metric-accurate results for the additional 17,100 non-keyframe frames rests on the unquantified assumption that COLMAP recovers sufficiently accurate relative poses despite low texture, specular reflections, and varying illumination. No mean reprojection error, bundle-adjustment success rate, or fraction of retained frames is reported, so it is impossible to assess whether drift or dropped frames between keyframes undermine the expansion claim.

    Authors: We agree that reporting COLMAP-specific metrics would strengthen the assessment of relative pose accuracy. In the revised manuscript, we have added to Section 3 the mean reprojection error from bundle adjustment, the bundle-adjustment success rate, and the fraction of retained frames after filtering. These metrics confirm that COLMAP produced reliable reconstructions suitable for expanding the dataset. revision: yes

  2. Referee: [§4] §4 (validation): the two validation experiments are described at a high level but supply no quantitative error tables, ablation results on the scale-recovery step, or failure-case analysis. Without these, the evidence that the corrected poses improve downstream depth estimation remains insufficient to support the claim of a reliably expanded benchmark.

    Authors: We acknowledge that more detailed quantitative validation is warranted. In the revised Section 4, we have included quantitative error tables for both the stereo disparity and monocular depth estimation experiments, an ablation study isolating the contribution of the scale-recovery step, and a failure-case analysis discussing sequences affected by specular reflections or illumination changes. These additions provide clearer evidence of improvement in downstream tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline relies on external COLMAP and independent keyframe ground truth

full rationale

The described method applies standard COLMAP SfM to recover relative poses across the sequence and then performs a global scale alignment using the 35 independent structured-light keyframe depth maps. No equations, fitted parameters, or self-citations are shown that would make the output poses equivalent to the input data by construction. The expansion to 17,135 pairs follows directly from applying an external reconstruction tool plus external metric anchors; the derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The pipeline rests on standard assumptions of structure-from-motion in challenging visual conditions and on the accuracy of the original keyframe depth maps; no new free parameters or invented entities are introduced beyond the choice of COLMAP and the scale-alignment procedure.

axioms (2)
  • domain assumption COLMAP recovers sufficiently accurate relative poses from endoscopic image sequences despite low texture and specularities
    Invoked when the pipeline applies COLMAP to re-estimate poses for all frames.
  • domain assumption The original keyframe depth maps provide reliable metric ground truth for scale recovery
    Used in the scale-alignment step that maps reconstructions to metric space.

pith-pipeline@v0.9.0 · 5687 in / 1439 out tokens · 58719 ms · 2026-05-20T18:28:23.114434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Stereo correspondence and reconstruction of endo- scopic data challenge,

    Max Allan, Jonathan Mcleod, Congcong Wang, Jean Claude Rosenthal, Zhenglei Hu, Niklas Gard, Peter Eisert, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, et al. Stereo correspondence and reconstruction of endoscopic data challenge.arXiv preprint arXiv:2101.01133, 2021

  2. [2]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016

  3. [3]

    Depth anything in medical images: A comparative study

    John J Han, Ayberk Acar, Callahan Henry, and Jie Ying Wu. Depth anything in medical images: A comparative study. InMedical Imaging 2026: Image-Guided Procedures, Robotic Interventions, and Modeling, volume 13927, pages 58–66. SPIE, 2026

  4. [4]

    From monocular vision to autonomous action: Guiding tumor resection via 3d reconstruction

    Ayberk Acar, Mariana Smith, Lidia Al-Zogbi, Tanner Watts, Fangjie Li, Hao Li, Nural Yilmaz, Paul Maria Scheikl, Jesse F d’Almeida, Susheela Sharma, et al. From monocular vision to autonomous action: Guiding tumor resection via 3d reconstruction. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 21714–21720. IEEE, 2025

  5. [5]

    Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos.Medical image analysis, 71:102058, 2021

    Kutsev Bengisu Ozyoruk, Guliz Irem Gokceler, Taylor L Bobrow, Gulfize Coskun, Kagan Incetan, Yasin Almalioglu, Faisal Mahmood, Eva Curto, Luis Perdigoto, Marina Oliveira, et al. Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos.Medical image analysis, 71:102058, 2021

  6. [6]

    Realsyncol: a high-fidelity synthetic colon dataset for 3d reconstruction applications.arXiv preprint arXiv:2602.08397, 2026

    Chiara Lena, Davide Milesi, Alessandro Casella, Luca Carlini, Joseph C Norton, James Martin, Bruno Scaglioni, Keith L Obstein, Roberto De Sire, Marco Spadaccini, et al. Realsyncol: a high-fidelity synthetic colon dataset for 3d reconstruction applications.arXiv preprint arXiv:2602.08397, 2026. 5

  7. [7]

    Colonoscopy 3d video dataset with paired depth from 2d-3d registration.Medical image analysis, 90:102956, 2023

    Taylor L Bobrow, Mayank Golhar, Rohan Vijayan, Venkata S Akshintala, Juan R Garcia, and Nicholas J Durr. Colonoscopy 3d video dataset with paired depth from 2d-3d registration.Medical image analysis, 90:102956, 2023

  8. [8]

    C3vdv2–colonoscopy 3d video dataset with enhanced realism.arXiv preprint arXiv:2506.24074, 2025

    Mayank V Golhar, Lucas Sebastian Galeano Fretes, Loren Ayers, Venkata S Akshintala, Taylor L Bobrow, and Nicholas J Durr. C3vdv2–colonoscopy 3d video dataset with enhanced realism.arXiv preprint arXiv:2506.24074, 2025

  9. [9]

    Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models

    Sabina Martyniak, Joanna Kaleta, Diego Dall’Alba, Michał Naskr˛ et, Szymon Płotka, and Przemysław Ko- rzeniowski. Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4268–4278. IEEE, 2025

  10. [10]

    Endopbr: Photorealistic synthetic data for surgical 3d vision via physically-based rendering

    John J Han and Jie Ying Wu. Endopbr: Photorealistic synthetic data for surgical 3d vision via physically-based rendering. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5601–5611, 2026

  11. [11]

    Foundationstereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5249–5260, 2025

  12. [12]

    Fast-foundationstereo: Real-time zero-shot stereo matching

    Bowen Wen, Shaurya Dewan, and Stan Birchfield. Fast-foundationstereo: Real-time zero-shot stereo matching. arXiv preprint arXiv:2512.11130, 2025. 6 Table 4: Suggested train and validation split for the corrected SCARED dataset. The dataset is split approximately 70-30. The sequence code follows the format {dataset}_{keyframe}. Train Validation Seq Frames ...