SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

Chang Han Low; Daiyun Shen; Mengya Xu; Qian Li; Qi Dou; Shuojue Yang; Yueming Jin

arxiv: 2605.25598 · v1 · pith:M2YRODGYnew · submitted 2026-05-25 · 💻 cs.CV

SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

Daiyun Shen , Shuojue Yang , Chang Han Low , Qian Li , Mengya Xu , Qi Dou , Yueming Jin This is my paper

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical instrument pose estimationdense correspondencesynthetic datasetRGB-only 6D posetextureless objectsrobotic surgerycomputer visionEndoVis

0 comments

The pith

SurfSurg6D uses geometry-consistent dense correspondence plus a new synthetic dataset to estimate 6D poses of textureless surgical instruments from RGB images alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the challenge of recovering the full 6D pose of surgical tools, which lack texture, suffer occlusions, and have very few real labeled examples. It first builds SynSurg6D, a synthetic dataset engineered to cover a broader range of poses than existing real collections. It then introduces SurfSurg6D, a framework that recovers pose by computing dense 2D-to-3D surface correspondences while enforcing geometric consistency. Experiments on SurgRIPE, EndoVis2018, and SurgPose show the synthetic data improves several existing methods and that SurfSurg6D itself exceeds prior RGB-only results. Accurate real-time pose from ordinary cameras would directly support robotic assistance and skill assessment in surgery.

Core claim

Constructing the synthetic dataset SynSurg6D diversifies pose distributions during training, and the SurfSurg6D dense-correspondence framework, by establishing geometry-consistent mappings from image pixels to the instrument surface model, delivers more accurate and efficient 6D pose estimates than prior methods when only RGB input is available.

What carries the argument

SurfSurg6D, the dense-correspondence framework that maps image points to 3D surface points on the instrument while preserving geometric consistency to solve for the 6D pose.

If this is right

The synthetic SynSurg6D dataset raises accuracy of multiple existing pose estimators on real surgical test sets by expanding pose coverage.
SurfSurg6D produces higher-precision RGB-only 6D estimates than prior methods while remaining computationally efficient.
The approach improves robustness to textureless surfaces and partial occlusions typical in minimally invasive surgery.
RGB-only operation removes the need for depth sensors, simplifying deployment in standard operating rooms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dense-correspondence design could extend to tracking other textureless medical devices such as catheters or implants.
Wider use of procedurally generated pose-diverse synthetic data may become routine for any vision task where real annotations are scarce.
Real-time versions of this pipeline could supply live instrument state for closed-loop robotic control or automated workflow logging.
Combining the RGB pipeline with occasional depth checks might further raise reliability without requiring depth at every frame.

Load-bearing premise

The synthetic pose variations in SynSurg6D transfer to real surgical scenes without creating domain artifacts that reduce accuracy on actual data.

What would settle it

Apply SurfSurg6D and the improved baselines to a new real surgical video set containing instruments or lighting conditions absent from both the real and synthetic training data; if accuracy gains vanish, the transfer claim fails.

Figures

Figures reproduced from arXiv: 2605.25598 by Chang Han Low, Daiyun Shen, Mengya Xu, Qian Li, Qi Dou, Shuojue Yang, Yueming Jin.

**Figure 2.** Figure 2: Overview of the synthetic dataset generation pipeline and the SurfSurg6D framework. (a) Scene reconstruction and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of data variation, including background [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Fig.5. The results verify the generalization of SynSurg6D [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Examples of pose estimation results of SurfSurg6D and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The example of keypoint projection results of Surg [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Surgical instrument pose estimation provides crucial information for promising applications, including autonomous robotic surgery, skill assessment, and standardization of surgical workflow. However, this task remains highly challenging due to high precision requirements, frequent occlusions, textureless instruments, scarcity of depth information and very limited annotated data. These constraints often lead to unsatisfactory performance when employing general object pose estimation approaches to surgical scenarios. To address these issues, we first construct a new dataset SynSurg6D, to alleviate the data shortage in this task. We further propose SurfSurg6D, a dense-correspondence framework tailored for surgical instrument pose estimation. Experimental results on the SurgRIPE, EndoVis2018 and SurgPose datasets demonstrate that the introduction of our generated dataset SynSurg6D is able to diversify the pose distributions, thus enhancing the performance of existing approaches. Furthermore, SurfSurg6D outperforms existing methods, providing a robust solution for precise and efficient RGB-only pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurfSurg6D adds a synthetic dataset and dense correspondence pipeline aimed at textureless surgical tools, but the abstract leaves the quantitative gains and synthetic-to-real transfer unshown.

read the letter

The main contribution here is SynSurg6D, a new synthetic dataset meant to diversify pose distributions for surgical instruments, paired with SurfSurg6D, which uses geometry-consistent dense correspondence for RGB pose estimation.

This targets a genuine constraint in robotic surgery: textureless objects, occlusions, and scarce annotations make off-the-shelf 6D methods unreliable. Generating targeted synthetic data and testing on real sets like SurgRIPE, EndoVis2018, and SurgPose is a practical move, and the claim that the new data helps existing approaches is the sort of incremental step that can matter in applied domains.

The soft spot is the lack of visible support for the central claims. No numbers, ablations, or domain-gap checks appear in the abstract, so it is impossible to judge whether the reported outperformance comes from better pose coverage or from volume, bias, or other factors. The stress-test note on unverified synthetic-to-real transfer is on point given what is shown; without distribution overlap metrics or matched-volume controls, the diversification story stays provisional.

This is for people working on vision for surgical robotics or constrained 6D pose problems. A reader already in that niche could extract the dataset release and method details if they are released with the paper.

It should go to peer review because the problem is real and the approach is scoped to it, even though the current write-up will need the missing evidence to hold up.

Referee Report

2 major / 1 minor

Summary. The paper introduces the SynSurg6D synthetic dataset to address data scarcity and limited pose diversity in surgical instrument pose estimation, and proposes SurfSurg6D, a dense correspondence framework that enforces geometry consistency for RGB-only 6D pose estimation of textureless instruments. Experiments on SurgRIPE, EndoVis2018, and SurgPose are said to show that adding SynSurg6D improves existing methods via pose diversification and that SurfSurg6D outperforms prior approaches.

Significance. If the synthetic-to-real transfer claims hold with proper controls, the work could meaningfully advance data-efficient pose estimation for robotic surgery by providing both a new dataset and a tailored correspondence method; the emphasis on geometry consistency for textureless objects is a relevant technical direction.

major comments (2)

[Experimental results / abstract] The central claim that SynSurg6D diversifies pose distributions and thereby improves real-data performance (stated in the abstract and presumably in the experimental section) rests on an unverified assumption about synthetic-to-real transfer. No distribution-overlap metrics, domain-gap quantification (e.g., feature-space distances or appearance statistics), or ablation comparing matched-volume real augmentations versus SynSurg6D are referenced, leaving open the possibility that observed gains arise from data volume, rendering bias, or other confounders rather than diversification.
[Abstract / Experiments] The abstract asserts outperformance of SurfSurg6D and benefit from SynSurg6D on three datasets but supplies no quantitative metrics, error breakdowns, or ablation studies. Without these in the main text, it is impossible to assess whether the reported gains are statistically meaningful or load-bearing for the method's contribution.

minor comments (1)

[Method] Notation for the dense correspondence and geometry-consistency losses should be introduced with explicit equations and variable definitions in the method section to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Experimental results / abstract] The central claim that SynSurg6D diversifies pose distributions and thereby improves real-data performance (stated in the abstract and presumably in the experimental section) rests on an unverified assumption about synthetic-to-real transfer. No distribution-overlap metrics, domain-gap quantification (e.g., feature-space distances or appearance statistics), or ablation comparing matched-volume real augmentations versus SynSurg6D are referenced, leaving open the possibility that observed gains arise from data volume, rendering bias, or other confounders rather than diversification.

Authors: We agree that explicit verification of pose diversification and controls for domain gap would strengthen the central claim. The current manuscript demonstrates performance gains on the three real datasets after incorporating SynSurg6D but does not report distribution-overlap metrics or domain-gap quantifications. In revision we will add (i) quantitative pose-distribution statistics (means, variances, and Wasserstein distances on rotation and translation parameters) comparing the original training sets to the augmented sets, (ii) t-SNE visualizations of image features from real and synthetic data to illustrate domain alignment, and (iii) an ablation that trains baseline methods with additional real-data augmentations of matched volume where such data exist. These additions will help isolate the contribution of pose diversification from volume or rendering effects. revision: yes
Referee: [Abstract / Experiments] The abstract asserts outperformance of SurfSurg6D and benefit from SynSurg6D on three datasets but supplies no quantitative metrics, error breakdowns, or ablation studies. Without these in the main text, it is impossible to assess whether the reported gains are statistically meaningful or load-bearing for the method's contribution.

Authors: The abstract is intentionally concise and follows standard conventions by omitting detailed numbers. The experimental section does contain quantitative comparisons across the three datasets; however, we acknowledge that more granular error breakdowns, statistical significance reporting, and component ablations would improve clarity and allow readers to evaluate the contribution more rigorously. In the revised manuscript we will expand the experimental section with (i) per-axis rotation/translation error tables, (ii) success-rate curves at multiple thresholds, (iii) ablation tables isolating the geometry-consistency loss and dense-correspondence components, and (iv) statistical tests (e.g., paired t-tests) on the reported improvements. These results will be presented in the main text with clear references from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces a synthetic dataset SynSurg6D and a dense correspondence method SurfSurg6D, with performance claims resting entirely on experimental comparisons against baselines on public external datasets (SurgRIPE, EndoVis2018, SurgPose). No equations, fitted parameters, or mathematical derivations appear that could reduce predictions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The evaluation is independent of the method's internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; full methods section would be required to audit these.

pith-pipeline@v0.9.1-grok · 5713 in / 1065 out tokens · 33802 ms · 2026-06-29T22:34:24.658251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Objective assessment of intraoperative skills for robot-assisted radical prostatec- tomy (rarp): results from the erus scientific and educational working groups metrics initiative,

A. Mottrie, E. Mazzone, P. Wiklund, M. Graefen, J. W. Collins, R. De Groote, P. Dell’Oglio, S. Puliatti, and A. G. Gallagher, “Objective assessment of intraoperative skills for robot-assisted radical prostatec- tomy (rarp): results from the erus scientific and educational working groups metrics initiative,”BJU international, vol. 128, no. 1, pp. 103– 111, 2021

2021
[2]

Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,

Y . Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. B ´ejar, D. D. Yuhet al., “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” inMICCAI workshop: M2cai, vol. 3, no. 2014, 2014, p. 3

2014
[3]

Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,

X. Gao, Y . Jin, Y . Long, Q. Dou, and P.-A. Heng, “Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” inMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24. Springer, 2021, p...

2021
[4]

Deep learning in surgical workflow analysis: a review of phase and step recognition,

K. C. Demir, H. Schieber, T. Weise, D. Roth, M. May, A. Maier, and S. H. Yang, “Deep learning in surgical workflow analysis: a review of phase and step recognition,”IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 11, pp. 5405–5417, 2023

2023
[5]

Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,

Y . Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,”IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1114–1126, 2017

2017
[6]

Concurrent segmentation and localization for tracking of surgical instruments,

I. Laina, N. Rieke, C. Rupprecht, J. P. Vizca ´ıno, A. Eslami, F. Tombari, and N. Navab, “Concurrent segmentation and localization for tracking of surgical instruments,” inInternational conference on medical image computing and computer-assisted intervention. Springer, 2017, pp. 664–672

2017
[7]

Differentiable rendering-based pose estimation for surgical robotic instruments,

Z. Liang, Z.-Y . Chiu, F. Richter, and M. C. Yip, “Differentiable rendering-based pose estimation for surgical robotic instruments,”arXiv preprint arXiv:2503.05953, 2025

work page arXiv 2025
[8]

A unified controller for region-reaching and deforming of soft objects,

Z. Wang, X. Li, D. Navarro-Alarcon, and Y .-h. Liu, “A unified controller for region-reaching and deforming of soft objects,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 472–478

2018
[9]

Caveats on the first-generation da vinci research kit: Latent technical constraints and essential calibrations [survey],

Z. Cui, J. Cartucho, S. Giannarou, and F. R. y Baena, “Caveats on the first-generation da vinci research kit: Latent technical constraints and essential calibrations [survey],”IEEE Robotics &; Automation Magazine, vol. 32, no. 2, p. 113–128, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1109/MRA.2023.3310863

work page doi:10.1109/mra.2023.3310863 2025
[10]

An open-source research kit for the da vinci® surgical system,

P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da vinci® surgical system,” in2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 6434–6439

2014
[11]

Raven-ii: an open plat- form for surgical robotics research,

B. Hannaford, J. Rosen, D. W. Friedman, H. King, P. Roan, L. Cheng, D. Glozman, J. Ma, S. N. Kosari, and L. White, “Raven-ii: an open plat- form for surgical robotics research,”IEEE Transactions on Biomedical Engineering, vol. 60, no. 4, pp. 954–959, 2012

2012
[12]

Surgripe challenge: Benchmark of surgical robot instrument pose estimation,

H. Xu, A. Weld, C. Xu, A. Roddan, J. Cartucho, M. A. Karaoglu, A. Ladikos, Y . Li, Y . Li, D. Shen, S. Yang, G. Lee, S. Park, J. Shin, Y .-G. Kim, L. Fothergill, D. Jones, P. Valdastri, D. Sarikaya, and S. Giannarou, “Surgripe challenge: Benchmark of surgical robot instrument pose estimation,”arXiv preprint arXiv:2501.02990, 2025

work page arXiv 2025
[13]

A computationally efficient method for hand–eye calibration,

Z. Zhang, L. Zhang, and G.-Z. Yang, “A computationally efficient method for hand–eye calibration,”International journal of computer assisted radiology and surgery, vol. 12, no. 10, pp. 1775–1787, 2017

2017
[14]

Re- alistic data generation for 6d pose estimation of surgical instruments,

J. A. Barragan, J. Zhang, H. Zhou, A. Munawar, and P. Kazanzides, “Re- alistic data generation for 6d pose estimation of surgical instruments,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 13 347–13 353

2024
[15]

GDR-Net: Geometry- guided direct regression network for monocular 6d object pose esti- mation,

G. Wang, F. Manhardt, F. Tombari, and X. Ji, “GDR-Net: Geometry- guided direct regression network for monocular 6d object pose esti- mation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 16 611–16 621

2021
[16]

Mrc-net: 6-dof pose estimation with multiscale residual correlation,

Y . Li, Y . Mao, R. Bala, and S. Hadap, “Mrc-net: 6-dof pose estimation with multiscale residual correlation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 476–10 486

2024
[17]

Megapose: 6d pose estimation of novel objects via render & compare,

Y . Labb ´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Trem- blay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6d pose estimation of novel objects via render & compare,” inProceedings of the 6th Conference on Robot Learning (CoRL), 2022

2022
[18]

Foundpose: Unseen object pose estimation with foundation features,

E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodaˇn, “Foundpose: Unseen object pose estimation with foundation features,”European Conference on Computer Vision (ECCV), 2024

2024
[19]

FoundationPose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

2024
[20]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Focal length and object pose estimation via render and compare,

G. Ponimatkin, Y . Labb ´e, B. Russell, M. Aubry, and J. Sivic, “Focal length and object pose estimation via render and compare,” 2022. [Online]. Available: https://arxiv.org/abs/2204.05145

work page arXiv 2022
[22]

Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,

M. K. Hasan, L. Calvet, N. Rabbani, and A. Bartoli, “Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,”Medical Image Analysis, vol. 70, p. 101994, 2021

2021
[23]

Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer,

J. Lu, F. Richter, and M. C. Yip, “Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4622–4629, 2022

2022
[24]

A unified monocular camera- based and pattern-free hand-to-eye calibration algorithm for surgical robots with rcm constraints,

B. Lu, B. Li, Q. Dou, and Y . Liu, “A unified monocular camera- based and pattern-free hand-to-eye calibration algorithm for surgical robots with rcm constraints,”IEEE/ASME Transactions on Mechatron- ics, vol. 27, no. 6, pp. 5124–5135, 2022

2022
[25]

3-d pose estimation of articulated instruments in robotic minimally invasive surgery,

M. Allan, S. Ourselin, D. J. Hawkes, J. D. Kelly, and D. Stoyanov, “3-d pose estimation of articulated instruments in robotic minimally invasive surgery,”IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1204–1213, 2018

2018
[26]

Instrument-splatting: Controllable photorealistic reconstruc- tion of surgical instruments using gaussian splatting,

S. Yang, Z. Wu, M. Hong, Q. Li, D. Shen, S. E. Salcudean, and Y . Jin, “Instrument-splatting: Controllable photorealistic reconstruc- tion of surgical instruments using gaussian splatting,”arXiv preprint arXiv:2503.04082, 2025

work page arXiv 2025
[27]

Surfemb: Dense and continuous cor- respondence distributions for object pose estimation with learnt surface embeddings,

R. L. Haugaard and A. G. Buch, “Surfemb: Dense and continuous cor- respondence distributions for object pose estimation with learnt surface embeddings,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6739–6748

2022
[28]

Surgpose: a dataset for articulated robotic surgical tool pose estimation and tracking,

Z. Wu, A. Schmidt, R. Moore, H. Zhou, A. Banks, P. Kazanzides, and S. E. Salcudean, “Surgpose: a dataset for articulated robotic surgical tool pose estimation and tracking,” 2025. [Online]. Available: https://arxiv.org/abs/2502.11534

work page arXiv 2025
[29]

arXiv preprint arXiv:2001.11190 (2020)

M. Allan, S. Kondo, S. Bodenstedt, S. Leger, R. Kadkhodamoham- madi, I. Luengo, F. Fuentes, E. Flouty, A. Mohammed, M. Pedersen et al., “2018 robotic scene segmentation challenge,”arXiv preprint arXiv:2001.11190, 2020

work page arXiv 2018
[30]

Bop: Benchmark for 6d object pose estimation,

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabuliset al., “Bop: Benchmark for 6d object pose estimation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 19–34

2018
[31]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inAsian conference on computer vision. Springer, 2012, pp. 548–562

2012
[32]

Endosurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos,

R. Zha, X. Cheng, H. Li, M. Harandi, and Z. Ge, “Endosurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos,” inInternational conference on medical image computing and computer-assisted intervention. Springer, 2023, pp. 13–23

2023
[33]

Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery,

Y . Wang, Y . Long, S. H. Fan, and Q. Dou, “Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2022, pp. 431–441

2022
[34]

Blenderproc2: A procedural pipeline for photorealistic rendering,

M. Denninger, D. Winkelbauer, M. Sundermeyer, W. Boerdijk, M. Knauer, K. H. Strobl, M. Humt, and R. Triebel, “Blenderproc2: A procedural pipeline for photorealistic rendering,”Journal of Open Source Software, vol. 8, no. 82, p. 4901, 2023. [Online]. Available: https://doi.org/10.21105/joss.04901

work page doi:10.21105/joss.04901 2023
[35]

instrument cad,

jhu dvrk, “instrument cad,” github, n.d., online. [Online]. Available: https://github.com/jhu-dvrk/instrument-cad
[36]

GrabCAD — Model Library,

GrabCAD, “GrabCAD — Model Library,” GrabCAD, n.d., online. [Online]. Available: https://grabcad.com/library
[37]

Representation Learning with Contrastive Predictive Coding

A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Hartley and A

R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

2003
[39]

Ep n p: An accurate o (n) solution to the p n p problem,

V . Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,”International journal of computer vision, vol. 81, no. 2, pp. 155–166, 2009

2009
[40]

Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

J. Robinson, C.-Y . Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,”arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010
[41]

Measures of the amount of ecologic association between species,

L. R. Dice, “Measures of the amount of ecologic association between species,”Ecology, vol. 26, no. 3, pp. 297–302, 1945

1945
[42]

Coco challenge: Keypoint evaluation,

COCO Consortium, “Coco challenge: Keypoint evaluation,” https:// cocodataset.org/#keypoints-eval, 2014, accessed: 2026-01-08

2014
[43]

Sitzmann, J

V . Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” 2020. [Online]. Available: https://arxiv.org/abs/2006.09661

work page arXiv 2020
[44]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[46]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Realistic surgical image dataset generation based on 3d gaussian splatting,

T. Zeng, G. Loza Galindo, J. Hu, P. Valdastri, and D. Jones, “Realistic surgical image dataset generation based on 3d gaussian splatting,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2024, pp. 510–519

2024
[48]

Real-time capable learning-based visual tool pose correction via differentiable simulation,

S. Yang and Z. Chua, “Real-time capable learning-based visual tool pose correction via differentiable simulation,”arXiv preprint arXiv:2505.08875, 2025

work page arXiv 2025

[1] [1]

Objective assessment of intraoperative skills for robot-assisted radical prostatec- tomy (rarp): results from the erus scientific and educational working groups metrics initiative,

A. Mottrie, E. Mazzone, P. Wiklund, M. Graefen, J. W. Collins, R. De Groote, P. Dell’Oglio, S. Puliatti, and A. G. Gallagher, “Objective assessment of intraoperative skills for robot-assisted radical prostatec- tomy (rarp): results from the erus scientific and educational working groups metrics initiative,”BJU international, vol. 128, no. 1, pp. 103– 111, 2021

2021

[2] [2]

Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,

Y . Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. B ´ejar, D. D. Yuhet al., “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” inMICCAI workshop: M2cai, vol. 3, no. 2014, 2014, p. 3

2014

[3] [3]

Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,

X. Gao, Y . Jin, Y . Long, Q. Dou, and P.-A. Heng, “Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” inMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24. Springer, 2021, p...

2021

[4] [4]

Deep learning in surgical workflow analysis: a review of phase and step recognition,

K. C. Demir, H. Schieber, T. Weise, D. Roth, M. May, A. Maier, and S. H. Yang, “Deep learning in surgical workflow analysis: a review of phase and step recognition,”IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 11, pp. 5405–5417, 2023

2023

[5] [5]

Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,

Y . Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,”IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1114–1126, 2017

2017

[6] [6]

Concurrent segmentation and localization for tracking of surgical instruments,

I. Laina, N. Rieke, C. Rupprecht, J. P. Vizca ´ıno, A. Eslami, F. Tombari, and N. Navab, “Concurrent segmentation and localization for tracking of surgical instruments,” inInternational conference on medical image computing and computer-assisted intervention. Springer, 2017, pp. 664–672

2017

[7] [7]

Differentiable rendering-based pose estimation for surgical robotic instruments,

Z. Liang, Z.-Y . Chiu, F. Richter, and M. C. Yip, “Differentiable rendering-based pose estimation for surgical robotic instruments,”arXiv preprint arXiv:2503.05953, 2025

work page arXiv 2025

[8] [8]

A unified controller for region-reaching and deforming of soft objects,

Z. Wang, X. Li, D. Navarro-Alarcon, and Y .-h. Liu, “A unified controller for region-reaching and deforming of soft objects,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 472–478

2018

[9] [9]

Caveats on the first-generation da vinci research kit: Latent technical constraints and essential calibrations [survey],

Z. Cui, J. Cartucho, S. Giannarou, and F. R. y Baena, “Caveats on the first-generation da vinci research kit: Latent technical constraints and essential calibrations [survey],”IEEE Robotics &; Automation Magazine, vol. 32, no. 2, p. 113–128, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1109/MRA.2023.3310863

work page doi:10.1109/mra.2023.3310863 2025

[10] [10]

An open-source research kit for the da vinci® surgical system,

P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da vinci® surgical system,” in2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 6434–6439

2014

[11] [11]

Raven-ii: an open plat- form for surgical robotics research,

B. Hannaford, J. Rosen, D. W. Friedman, H. King, P. Roan, L. Cheng, D. Glozman, J. Ma, S. N. Kosari, and L. White, “Raven-ii: an open plat- form for surgical robotics research,”IEEE Transactions on Biomedical Engineering, vol. 60, no. 4, pp. 954–959, 2012

2012

[12] [12]

Surgripe challenge: Benchmark of surgical robot instrument pose estimation,

H. Xu, A. Weld, C. Xu, A. Roddan, J. Cartucho, M. A. Karaoglu, A. Ladikos, Y . Li, Y . Li, D. Shen, S. Yang, G. Lee, S. Park, J. Shin, Y .-G. Kim, L. Fothergill, D. Jones, P. Valdastri, D. Sarikaya, and S. Giannarou, “Surgripe challenge: Benchmark of surgical robot instrument pose estimation,”arXiv preprint arXiv:2501.02990, 2025

work page arXiv 2025

[13] [13]

A computationally efficient method for hand–eye calibration,

Z. Zhang, L. Zhang, and G.-Z. Yang, “A computationally efficient method for hand–eye calibration,”International journal of computer assisted radiology and surgery, vol. 12, no. 10, pp. 1775–1787, 2017

2017

[14] [14]

Re- alistic data generation for 6d pose estimation of surgical instruments,

J. A. Barragan, J. Zhang, H. Zhou, A. Munawar, and P. Kazanzides, “Re- alistic data generation for 6d pose estimation of surgical instruments,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 13 347–13 353

2024

[15] [15]

GDR-Net: Geometry- guided direct regression network for monocular 6d object pose esti- mation,

G. Wang, F. Manhardt, F. Tombari, and X. Ji, “GDR-Net: Geometry- guided direct regression network for monocular 6d object pose esti- mation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 16 611–16 621

2021

[16] [16]

Mrc-net: 6-dof pose estimation with multiscale residual correlation,

Y . Li, Y . Mao, R. Bala, and S. Hadap, “Mrc-net: 6-dof pose estimation with multiscale residual correlation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 476–10 486

2024

[17] [17]

Megapose: 6d pose estimation of novel objects via render & compare,

Y . Labb ´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Trem- blay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6d pose estimation of novel objects via render & compare,” inProceedings of the 6th Conference on Robot Learning (CoRL), 2022

2022

[18] [18]

Foundpose: Unseen object pose estimation with foundation features,

E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodaˇn, “Foundpose: Unseen object pose estimation with foundation features,”European Conference on Computer Vision (ECCV), 2024

2024

[19] [19]

FoundationPose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

2024

[20] [20]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Focal length and object pose estimation via render and compare,

G. Ponimatkin, Y . Labb ´e, B. Russell, M. Aubry, and J. Sivic, “Focal length and object pose estimation via render and compare,” 2022. [Online]. Available: https://arxiv.org/abs/2204.05145

work page arXiv 2022

[22] [22]

Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,

M. K. Hasan, L. Calvet, N. Rabbani, and A. Bartoli, “Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,”Medical Image Analysis, vol. 70, p. 101994, 2021

2021

[23] [23]

Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer,

J. Lu, F. Richter, and M. C. Yip, “Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4622–4629, 2022

2022

[24] [24]

A unified monocular camera- based and pattern-free hand-to-eye calibration algorithm for surgical robots with rcm constraints,

B. Lu, B. Li, Q. Dou, and Y . Liu, “A unified monocular camera- based and pattern-free hand-to-eye calibration algorithm for surgical robots with rcm constraints,”IEEE/ASME Transactions on Mechatron- ics, vol. 27, no. 6, pp. 5124–5135, 2022

2022

[25] [25]

3-d pose estimation of articulated instruments in robotic minimally invasive surgery,

M. Allan, S. Ourselin, D. J. Hawkes, J. D. Kelly, and D. Stoyanov, “3-d pose estimation of articulated instruments in robotic minimally invasive surgery,”IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1204–1213, 2018

2018

[26] [26]

Instrument-splatting: Controllable photorealistic reconstruc- tion of surgical instruments using gaussian splatting,

S. Yang, Z. Wu, M. Hong, Q. Li, D. Shen, S. E. Salcudean, and Y . Jin, “Instrument-splatting: Controllable photorealistic reconstruc- tion of surgical instruments using gaussian splatting,”arXiv preprint arXiv:2503.04082, 2025

work page arXiv 2025

[27] [27]

Surfemb: Dense and continuous cor- respondence distributions for object pose estimation with learnt surface embeddings,

R. L. Haugaard and A. G. Buch, “Surfemb: Dense and continuous cor- respondence distributions for object pose estimation with learnt surface embeddings,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6739–6748

2022

[28] [28]

Surgpose: a dataset for articulated robotic surgical tool pose estimation and tracking,

Z. Wu, A. Schmidt, R. Moore, H. Zhou, A. Banks, P. Kazanzides, and S. E. Salcudean, “Surgpose: a dataset for articulated robotic surgical tool pose estimation and tracking,” 2025. [Online]. Available: https://arxiv.org/abs/2502.11534

work page arXiv 2025

[29] [29]

arXiv preprint arXiv:2001.11190 (2020)

M. Allan, S. Kondo, S. Bodenstedt, S. Leger, R. Kadkhodamoham- madi, I. Luengo, F. Fuentes, E. Flouty, A. Mohammed, M. Pedersen et al., “2018 robotic scene segmentation challenge,”arXiv preprint arXiv:2001.11190, 2020

work page arXiv 2018

[30] [30]

Bop: Benchmark for 6d object pose estimation,

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabuliset al., “Bop: Benchmark for 6d object pose estimation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 19–34

2018

[31] [31]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inAsian conference on computer vision. Springer, 2012, pp. 548–562

2012

[32] [32]

Endosurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos,

R. Zha, X. Cheng, H. Li, M. Harandi, and Z. Ge, “Endosurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos,” inInternational conference on medical image computing and computer-assisted intervention. Springer, 2023, pp. 13–23

2023

[33] [33]

Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery,

Y . Wang, Y . Long, S. H. Fan, and Q. Dou, “Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2022, pp. 431–441

2022

[34] [34]

Blenderproc2: A procedural pipeline for photorealistic rendering,

M. Denninger, D. Winkelbauer, M. Sundermeyer, W. Boerdijk, M. Knauer, K. H. Strobl, M. Humt, and R. Triebel, “Blenderproc2: A procedural pipeline for photorealistic rendering,”Journal of Open Source Software, vol. 8, no. 82, p. 4901, 2023. [Online]. Available: https://doi.org/10.21105/joss.04901

work page doi:10.21105/joss.04901 2023

[35] [35]

instrument cad,

jhu dvrk, “instrument cad,” github, n.d., online. [Online]. Available: https://github.com/jhu-dvrk/instrument-cad

[36] [36]

GrabCAD — Model Library,

GrabCAD, “GrabCAD — Model Library,” GrabCAD, n.d., online. [Online]. Available: https://grabcad.com/library

[37] [37]

Representation Learning with Contrastive Predictive Coding

A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

Hartley and A

R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

2003

[39] [39]

Ep n p: An accurate o (n) solution to the p n p problem,

V . Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,”International journal of computer vision, vol. 81, no. 2, pp. 155–166, 2009

2009

[40] [40]

Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

J. Robinson, C.-Y . Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,”arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010

[41] [41]

Measures of the amount of ecologic association between species,

L. R. Dice, “Measures of the amount of ecologic association between species,”Ecology, vol. 26, no. 3, pp. 297–302, 1945

1945

[42] [42]

Coco challenge: Keypoint evaluation,

COCO Consortium, “Coco challenge: Keypoint evaluation,” https:// cocodataset.org/#keypoints-eval, 2014, accessed: 2026-01-08

2014

[43] [43]

Sitzmann, J

V . Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” 2020. [Online]. Available: https://arxiv.org/abs/2006.09661

work page arXiv 2020

[44] [44]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[45] [45]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

[46] [46]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Realistic surgical image dataset generation based on 3d gaussian splatting,

T. Zeng, G. Loza Galindo, J. Hu, P. Valdastri, and D. Jones, “Realistic surgical image dataset generation based on 3d gaussian splatting,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2024, pp. 510–519

2024

[48] [48]

Real-time capable learning-based visual tool pose correction via differentiable simulation,

S. Yang and Z. Chua, “Real-time capable learning-based visual tool pose correction via differentiable simulation,”arXiv preprint arXiv:2505.08875, 2025

work page arXiv 2025