SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation
Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3
The pith
A pipeline using COLMAP and scale alignment fixes camera poses in the SCARED endoscopic dataset, turning 35 reliable RGB-D pairs into 17,135.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying COLMAP to re-estimate camera poses across all frames, followed by a scale-recovery step that registers the reconstructions to the ground-truth keyframe depth maps, yields metric-accurate poses for non-keyframe images and expands the set of reliable RGB-D pairs from 35 to 17,135.
What carries the argument
The two-stage pipeline of COLMAP structure-from-motion pose estimation followed by scale recovery that aligns each reconstruction to metric space using the ground-truth keyframe depth maps.
If this is right
- Full video sequences become usable for supervised training of monocular and stereo depth networks.
- Benchmark results on SCARED can now reflect performance across thousands of frames rather than a few dozen.
- Pose-corrected data supports evaluation of methods that require temporally consistent geometry.
- The same correction process can be rerun whenever improved structure-from-motion tools appear.
Where Pith is reading between the lines
- The corrected dataset may reduce the domain gap when models trained on SCARED-C are deployed on real surgical video that also suffers from specular highlights.
- Future datasets could adopt the same keyframe-plus-SfM pattern to avoid relying on robot kinematics for pose labels.
- Downstream tasks such as 3D reconstruction of surgical scenes or instrument tracking could directly benefit from the denser metric depth maps.
Load-bearing premise
COLMAP can recover sufficiently accurate relative camera poses from endoscopic sequences despite low texture, specular reflections, and changing illumination.
What would settle it
A direct comparison showing that stereo disparity errors or monocular depth estimation accuracy on the corrected poses is no better than (or worse than) the original kinematic poses would falsify the claim.
Figures
read the original abstract
The SCARED dataset is a widely used benchmark for endoscopic depth estimation, offering ground-truth 3D reconstructions captured with a structured light sensor. However, the depth maps for non-keyframe images rely on robot kinematics that introduce substantial pose errors, limiting the reliably labeled portion of the dataset to 35 keyframes. We present SCARED-C, a corrected version of the SCARED dataset that expands the number of reliable RGB-D pairs from 35 to 17,135. Our pipeline applies COLMAP, a Structure-from-Motion system, to re-estimate camera poses for all frames, followed by a scale recovery step that aligns the resulting reconstructions to metric space using the ground-truth keyframe depth maps. We validate the corrected poses through (1) stereo disparity evaluation and (2) monocular depth estimation experiments. The corrected dataset and code are publicly released to the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SCARED-C, a corrected version of the SCARED endoscopic dataset. It uses COLMAP Structure-from-Motion to re-estimate camera poses across all frames, followed by a global scale recovery step that aligns the reconstruction to metric space using the 35 ground-truth keyframe depth maps. This expands the set of reliable RGB-D pairs from 35 to 17,135. The corrected poses are validated via (1) stereo disparity evaluation and (2) monocular depth estimation experiments, with public release of the dataset and code.
Significance. If the corrected poses prove accurate, the work would meaningfully enlarge the usable supervised training data for endoscopic depth estimation, directly addressing the robot-kinematics pose errors that currently restrict the original SCARED benchmark to only 35 keyframes. The pipeline demonstrates a practical way to leverage off-the-shelf SfM tools for domain-specific pose correction without introducing new learned parameters.
major comments (2)
- [Abstract and §3] Abstract and §3 (pipeline description): the central claim that the scale-aligned COLMAP poses yield metric-accurate results for the additional 17,100 non-keyframe frames rests on the unquantified assumption that COLMAP recovers sufficiently accurate relative poses despite low texture, specular reflections, and varying illumination. No mean reprojection error, bundle-adjustment success rate, or fraction of retained frames is reported, so it is impossible to assess whether drift or dropped frames between keyframes undermine the expansion claim.
- [§4] §4 (validation): the two validation experiments are described at a high level but supply no quantitative error tables, ablation results on the scale-recovery step, or failure-case analysis. Without these, the evidence that the corrected poses improve downstream depth estimation remains insufficient to support the claim of a reliably expanded benchmark.
minor comments (1)
- [Abstract] The abstract would benefit from a single sentence stating the original SCARED limitation (robot-kinematics pose errors on non-keyframes) before describing the correction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the manuscript to provide the requested quantitative details.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (pipeline description): the central claim that the scale-aligned COLMAP poses yield metric-accurate results for the additional 17,100 non-keyframe frames rests on the unquantified assumption that COLMAP recovers sufficiently accurate relative poses despite low texture, specular reflections, and varying illumination. No mean reprojection error, bundle-adjustment success rate, or fraction of retained frames is reported, so it is impossible to assess whether drift or dropped frames between keyframes undermine the expansion claim.
Authors: We agree that reporting COLMAP-specific metrics would strengthen the assessment of relative pose accuracy. In the revised manuscript, we have added to Section 3 the mean reprojection error from bundle adjustment, the bundle-adjustment success rate, and the fraction of retained frames after filtering. These metrics confirm that COLMAP produced reliable reconstructions suitable for expanding the dataset. revision: yes
-
Referee: [§4] §4 (validation): the two validation experiments are described at a high level but supply no quantitative error tables, ablation results on the scale-recovery step, or failure-case analysis. Without these, the evidence that the corrected poses improve downstream depth estimation remains insufficient to support the claim of a reliably expanded benchmark.
Authors: We acknowledge that more detailed quantitative validation is warranted. In the revised Section 4, we have included quantitative error tables for both the stereo disparity and monocular depth estimation experiments, an ablation study isolating the contribution of the scale-recovery step, and a failure-case analysis discussing sequences affected by specular reflections or illumination changes. These additions provide clearer evidence of improvement in downstream tasks. revision: yes
Circularity Check
No circularity: pipeline relies on external COLMAP and independent keyframe ground truth
full rationale
The described method applies standard COLMAP SfM to recover relative poses across the sequence and then performs a global scale alignment using the 35 independent structured-light keyframe depth maps. No equations, fitted parameters, or self-citations are shown that would make the output poses equivalent to the input data by construction. The expansion to 17,135 pairs follows directly from applying an external reconstruction tool plus external metric anchors; the derivation chain remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption COLMAP recovers sufficiently accurate relative poses from endoscopic image sequences despite low texture and specularities
- domain assumption The original keyframe depth maps provide reliable metric ground truth for scale recovery
Reference graph
Works this paper leans on
-
[1]
Stereo correspondence and reconstruction of endo- scopic data challenge,
Max Allan, Jonathan Mcleod, Congcong Wang, Jean Claude Rosenthal, Zhenglei Hu, Niklas Gard, Peter Eisert, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, et al. Stereo correspondence and reconstruction of endoscopic data challenge.arXiv preprint arXiv:2101.01133, 2021
-
[2]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[3]
Depth anything in medical images: A comparative study
John J Han, Ayberk Acar, Callahan Henry, and Jie Ying Wu. Depth anything in medical images: A comparative study. InMedical Imaging 2026: Image-Guided Procedures, Robotic Interventions, and Modeling, volume 13927, pages 58–66. SPIE, 2026
work page 2026
-
[4]
From monocular vision to autonomous action: Guiding tumor resection via 3d reconstruction
Ayberk Acar, Mariana Smith, Lidia Al-Zogbi, Tanner Watts, Fangjie Li, Hao Li, Nural Yilmaz, Paul Maria Scheikl, Jesse F d’Almeida, Susheela Sharma, et al. From monocular vision to autonomous action: Guiding tumor resection via 3d reconstruction. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 21714–21720. IEEE, 2025
work page 2025
-
[5]
Kutsev Bengisu Ozyoruk, Guliz Irem Gokceler, Taylor L Bobrow, Gulfize Coskun, Kagan Incetan, Yasin Almalioglu, Faisal Mahmood, Eva Curto, Luis Perdigoto, Marina Oliveira, et al. Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos.Medical image analysis, 71:102058, 2021
work page 2021
-
[6]
Chiara Lena, Davide Milesi, Alessandro Casella, Luca Carlini, Joseph C Norton, James Martin, Bruno Scaglioni, Keith L Obstein, Roberto De Sire, Marco Spadaccini, et al. Realsyncol: a high-fidelity synthetic colon dataset for 3d reconstruction applications.arXiv preprint arXiv:2602.08397, 2026. 5
-
[7]
Taylor L Bobrow, Mayank Golhar, Rohan Vijayan, Venkata S Akshintala, Juan R Garcia, and Nicholas J Durr. Colonoscopy 3d video dataset with paired depth from 2d-3d registration.Medical image analysis, 90:102956, 2023
work page 2023
-
[8]
C3vdv2–colonoscopy 3d video dataset with enhanced realism.arXiv preprint arXiv:2506.24074, 2025
Mayank V Golhar, Lucas Sebastian Galeano Fretes, Loren Ayers, Venkata S Akshintala, Taylor L Bobrow, and Nicholas J Durr. C3vdv2–colonoscopy 3d video dataset with enhanced realism.arXiv preprint arXiv:2506.24074, 2025
-
[9]
Sabina Martyniak, Joanna Kaleta, Diego Dall’Alba, Michał Naskr˛ et, Szymon Płotka, and Przemysław Ko- rzeniowski. Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4268–4278. IEEE, 2025
work page 2025
-
[10]
Endopbr: Photorealistic synthetic data for surgical 3d vision via physically-based rendering
John J Han and Jie Ying Wu. Endopbr: Photorealistic synthetic data for surgical 3d vision via physically-based rendering. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5601–5611, 2026
work page 2026
-
[11]
Foundationstereo: Zero-shot stereo matching
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5249–5260, 2025
work page 2025
-
[12]
Fast-foundationstereo: Real-time zero-shot stereo matching
Bowen Wen, Shaurya Dewan, and Stan Birchfield. Fast-foundationstereo: Real-time zero-shot stereo matching. arXiv preprint arXiv:2512.11130, 2025. 6 Table 4: Suggested train and validation split for the corrected SCARED dataset. The dataset is split approximately 70-30. The sequence code follows the format {dataset}_{keyframe}. Train Validation Seq Frames ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.