Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Grayson Byrd; Han Zhang; Hao Ding; Hongchao Shu; Juan Antonio Barragan; Lalithkumar Seenivasan; Mathias Unberath; Peter Kazanzides; Pu Xiao; Russell H. Taylor

arxiv: 2409.13107 · v3 · submitted 2024-09-19 · 💻 cs.RO

Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Hao Ding , Lalithkumar Seenivasan , Hongchao Shu , Grayson Byrd , Han Zhang , Pu Xiao , Juan Antonio Barragan , Russell H. Taylor

show 2 more authors

Peter Kazanzides Mathias Unberath

This is my paper

Pith reviewed 2026-05-23 20:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords digital twinvision foundation modelsLLM agentssurgical automationembodied intelligencepeg transfergauze retrievaldVRK platform

0 comments

The pith

Digital twin representations from vision foundation models enable LLM agents to plan surgical tasks with greater flexibility than prior limited perception methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a digital twin perception approach built on vision foundation models can generate the detailed natural language scene descriptions required for LLM agents to plan complex surgical action sequences. Earlier work depended on simple perception solutions that sufficed only for tightly controlled bench-top tests and could not scale to less constrained environments. The authors integrate their digital twin representation with an LLM planner and the dVRK platform, then evaluate the resulting system on peg transfer and gauze retrieval tasks. A sympathetic reader would care because reliable scene understanding is the missing piece that could let embodied surgical systems handle real variability without custom engineering for each setting.

Core claim

The paper claims that capitalizing on the performance and out-of-the-box generalization of recent vision foundation models yields a digital twin representation that supplies sufficiently detailed natural language scene information for LLM-based task planning; when this representation is combined with an LLM agent and deployed on the dVRK platform, the embodied system achieves strong task performance and generalizability to varied environmental settings on peg transfer and gauze retrieval.

What carries the argument

The digital twin-based machine perception approach that uses vision foundation models to convert visual input into detailed natural language scene representations for LLM planning.

If this is right

The system performs peg transfer and gauze retrieval tasks with strong results.
Performance holds across varied environmental settings.
The approach provides greater flexibility than prior simple perception solutions for scaling to less constrained conditions.
Integration of the digital twin representation with an LLM agent produces an embodied intelligence system on the dVRK platform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same digital twin construction could support planning for additional surgical tasks once the foundation models are shown to handle more complex scenes.
A fuller digital twin framework might improve the ability to interpret why the LLM selects particular action sequences.
Wider use of foundation-model digital twins could reduce the need for task-specific perception code in other robotic automation domains.

Load-bearing premise

Vision foundation models can supply natural language scene representations that are detailed and accurate enough for LLM agents to plan reliably in less constrained surgical environments.

What would settle it

A controlled test in which the digital twin representation produces incomplete or inaccurate scene descriptions, causing the LLM planner to generate incorrect action sequences during peg transfer or gauze retrieval under changed lighting, object arrangements, or instrument variations.

Figures

Figures reproduced from arXiv: 2409.13107 by Grayson Byrd, Han Zhang, Hao Ding, Hongchao Shu, Juan Antonio Barragan, Lalithkumar Seenivasan, Mathias Unberath, Peter Kazanzides, Pu Xiao, Russell H. Taylor.

**Figure 1.** Figure 1: Illustration of the digital twin-based embodied surgical system. A machine perception module is applied to extract digital twin-based scene [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the workflow of the proposed embodied surgical system with digital twin-based machine perception. The captured image is first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of physical setup and varied experimental environment. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments, but lacks the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin (DT)-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our DT representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environmental settings. Despite a convincing performance, this work is merely a first step towards the integration of DT representations. Future studies are necessary for the realization of a comprehensive DT framework to improve the interpretability and generalizability of embodied intelligence in surgery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper integrates vision foundation models into digital-twin scene representations that feed an LLM planner on the dVRK for peg transfer and gauze retrieval, but reports no metrics to support its performance claims.

read the letter

The paper integrates vision foundation models to build digital-twin representations of the surgical scene and feeds those into an LLM agent for action planning on the dVRK. They run the system on peg transfer and gauze retrieval and state that it handles varied settings better than earlier limited perception setups for LLM surgical agents. That combination is the concrete step forward here; prior work had relied on hand-crafted or overly constrained scene inputs, so using off-the-shelf foundation models for richer natural-language descriptions is a direct response to that gap. The abstract is also clear that this is only an initial integration and that a fuller DT framework is still needed, which keeps the claims in proportion. The main weakness is the absence of any numbers. The text asserts strong task performance and generalizability yet gives no success rates, trial counts, error bars, baselines, or dataset details. Without those, the central empirical claim cannot be checked. The methods and results sections are referenced but not visible in the supplied material, so the evaluation remains unverified. This is aimed at groups working on embodied LLM agents in surgery or robotics who need examples of how foundation-model perception can be wired into planning loops. It is early-stage but the integration is explicit enough that a serious referee could usefully press for the missing quantitative evidence and comparisons. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes using digital twin (DT) representations generated by vision foundation models to supply detailed natural-language scene descriptions for LLM-based agents in surgical task planning. It integrates this perception module with an LLM planner and the dVRK platform to create an embodied system, then evaluates robustness on peg-transfer and gauze-retrieval tasks, claiming strong performance and generalizability to varied settings while describing the work as an initial step toward a fuller DT framework.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route to more flexible perception for LLM agents in robotic surgery, moving beyond the simple, constrained perception pipelines used in prior bench-top studies. The explicit framing as a first integration of foundation-model DTs with dVRK planning is a modest but concrete contribution to embodied surgical intelligence.

major comments (2)

[Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.
[Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.

minor comments (1)

[Abstract] The abstract contains a minor grammatical issue: 'developing similarly powerful and robust perception algorithms is necessary' should read 'developing ... algorithms are necessary' for subject-verb agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical substantiation in our abstract and results. We address the major comments point-by-point below and will revise the manuscript to improve clarity and verifiability of claims.

read point-by-point responses

Referee: [Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.

Authors: We agree that the abstract (and potentially the results presentation) would be strengthened by including explicit quantitative metrics. The manuscript describes evaluations on peg transfer and gauze retrieval using the dVRK, but we will revise the abstract to report key figures such as success rates across trials, number of environmental variations tested, and any statistical details. We will also ensure the results section explicitly lists trial counts, exclusion criteria if any, and error analysis to allow verification of the generalizability claims. revision: yes
Referee: [Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.

Authors: The abstract summarizes the DT perception approach without detailing ablations. The full evaluation on the two tasks with the integrated LLM planner provides implicit evidence of the perception module's utility in enabling planning. We will revise the abstract to qualify the assumption more explicitly and add a brief failure-case discussion or reference to robustness testing in the results section. A dedicated ablation study on the vision foundation model component alone is not present and would require additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical system integration of vision-foundation-model-derived digital twin scene representations with an LLM planner, evaluated on peg-transfer and gauze-retrieval tasks on the dVRK. No equations, parameter-fitting steps, uniqueness theorems, or self-citational load-bearing premises appear in the derivation chain; the central claim is that the integrated system exhibits task performance and generalizability in the reported experiments. This is a self-contained empirical demonstration rather than a reduction of any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is limited to the central modeling assumption stated in the text.

axioms (1)

domain assumption Vision foundation models deliver out-of-the-box generalization sufficient to produce detailed natural-language scene representations usable by LLM planners in surgical settings.
Abstract explicitly states the approach capitalizes on this property to overcome limitations of prior perception solutions.

pith-pipeline@v0.9.0 · 5808 in / 1139 out tokens · 23724 ms · 2026-05-23T20:17:13.233943+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research
cs.CV 2025-11 unverdicted novelty 5.0

TwinOR creates dynamic photorealistic digital twins of operating rooms that generate realistic RGB and depth data enabling embodied AI perception and localization tasks to match real-world performance levels.
SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge
cs.CV 2024-07 accept novelty 5.0

SegSTRONG-C provides a new benchmark where top models reach 0.9394 DSC and 0.9301 NSD on corrupted surgical tool segmentation tests, showing conventional techniques help but calling for more innovative robustness methods.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 7 internal anchors

[1]

Robotic surgery: Review on minimally invasive techniques,

B. Johansson, E. Eriksson, N. Berglund, and I. Lindgren, “Robotic surgery: Review on minimally invasive techniques,” Fusion of Mul- tidisciplinary Research, An International Journal , vol. 2, no. 2, pp. 201–210, 2021

work page 2021
[2]

A review on how da vinci surgical system is changing the health care,

N. Nath, “A review on how da vinci surgical system is changing the health care,” in The 2nd Advanced Manufacturing Student Conference (AMSC22) Chemnitz, Germany 07–08 July 2022 , vol. 7, 2022, p. 193

work page 2022
[3]

An open-source research kit for the da Vinci® Surgical System,

P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da Vinci® Surgical System,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6434–6439

work page 2014
[4]

SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,

M. Moghani, L. Doorenbos, W. C.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, and A. Garg, “SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,”arXiv preprint arXiv:2405.05226, 2024

work page arXiv 2024
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

W.et al.Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks (2024)

J. W. Kim, T. Z. Zhao, S. Schmidgall, A. Deguet, M. Kobilarov, C. Finn, and A. Krieger, “Surgical robot transformer (srt): Imitation learning for surgical tasks,” arXiv preprint arXiv:2407.12998 , 2024

work page arXiv 2024
[8]

Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,

J. Fu, Y . Long, K. Chen, W. Wei, and Q. Dou, “Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,” arXiv preprint arXiv:2405.18757, 2024

work page arXiv 2024
[9]

Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,

M. Hwang, B. Thananjeyan, S. Paradis, D. Seita, J. Ichnowski, D. Fer, T. Low, and K. Goldberg, “Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,” IEEE Robotics and Automation Letters , vol. 5, no. 4, pp. 5937–5944, 2020

work page 2020
[10]

Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,

M. Hwang, J. Ichnowski, B. Thananjeyan, D. Seita, S. Paradis, D. Fer, T. Low, and K. Goldberg, “Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,” IEEE Transactions on Automation Science and Engineering, vol. 20, no. 2, pp. 909–922, 2022

work page 2022
[11]

STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,

K. Hari, H. Kim, W. Panitch, K. Srinivas, V . Schorp, K. Dharmarajan, S. Ganti, T. Sadjadpour, and K. Goldberg, “STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,” arXiv preprint arXiv:2404.05151 , 2024

work page arXiv 2024
[12]

Autonomous robotic laparoscopic surgery for intestinal anastomosis,

H. Saeidi, J. D. Opfermann, M. Kam, S. Wei, S. L ´eonard, M. H. Hsieh, J. U. Kang, and A. Krieger, “Autonomous robotic laparoscopic surgery for intestinal anastomosis,” Science Robotics , vol. 7, no. 62, p. eabj2908, 2022

work page 2022
[13]

Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,

M. Kam, S. Wei, J. D. Opfermann, H. Saeidi, M. H. Hsieh, K. C. Wang, J. U. Kang, and A. Krieger, “Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3916–3923, 2023

work page 2023
[14]

Automating vascular shunt insertion with the dvrk surgical robot,

K. Dharmarajan, W. Panitch, M. Jiang, K. Srinivas, B. Shi, Y . Avigal, H. Huang, T. Low, D. Fer, and K. Goldberg, “Automating vascular shunt insertion with the dvrk surgical robot,” in 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 6781–6788

work page 2023
[15]

Learning to localize, grasp, and hand over unmodified surgical needles,

A. Wilcox, J. Kerr, B. Thananjeyan, J. Ichnowski, M. Hwang, S. Par- adis, D. Fer, and K. Goldberg, “Learning to localize, grasp, and hand over unmodified surgical needles,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 9637–9643

work page 2022
[16]

Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,

H. Ding, L. Seenivasan, B. D. Killeen, S. M. Cho, and M. Unberath, “Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,” ais, vol. 4, no. 3, pp. 109–138, 2024

work page 2024
[17]

Twin-S: a digital twin for skull base surgery,

H. Shu, R. Liang, Z. Li, A. Goodridge, X. Zhang, H. Ding, N. Nagu- ruru, M. Sahu, F. X. Creighton, R. H. Taylor, et al., “Twin-S: a digital twin for skull base surgery,”International journal of computer assisted radiology and surgery , vol. 18, no. 6, pp. 1077–1084, 2023

work page 2023
[18]

Creating a digital twin of spinal surgery: A proof of concept,

J. Hein, F. Giraud, L. Calvet, A. Schwarz, N. A. Cavalcanti, S. Prokudin, M. Farshad, S. Tang, M. Pollefeys, F. Carrillo, et al. , “Creating a digital twin of spinal surgery: A proof of concept,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2355–2364

work page 2024
[19]

Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,

B. D. Killeen, H. Zhang, L. J. Wang, Z. Liu, C. Kleinbeck, M. Rosen, R. H. Taylor, G. Osgood, and M. Unberath, “Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,” International Journal of Computer Assisted Radiology and Surgery , pp. 1–10, 2024

work page 2024
[20]

Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,

C. Kleinbeck, H. Zhang, B. D. Killeen, D. Roth, and M. Unberath, “Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,” Int. J. CARS, vol. 19, no. 7, pp. 1301–1312, July 2024

work page 2024
[21]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969

work page 2017
[22]

Hybrid task cascade for instance segmenta- tion,

K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 4974–4983

work page 2019
[23]

Deeply shape-guided cascade for instance segmentation,

H. Ding, S. Qiao, A. Yuille, and W. Shen, “Deeply shape-guided cascade for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 8278–8288

work page 2021
[24]

Pointrend: Image seg- mentation as rendering,

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9799–9808

work page 2020
[25]

Masked-attention mask transformer for universal image segmenta- tion,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 1290–1299

work page 2022
[26]

Pvnet: Pixel- wise voting network for 6dof pose estimation,

S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel- wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 4561–4570

work page 2019
[27]

6d pose estimation of objects: Recent technologies and challenges,

Z. He, W. Feng, X. Zhao, and Y . Lv, “6d pose estimation of objects: Recent technologies and challenges,” Applied Sciences, vol. 11, no. 1, p. 228, 2020

work page 2020
[28]

6d object position estimation from 2d images: A literature review,

G. Marullo, L. Tanzi, P. Piazzolla, and E. Vezzetti, “6d object position estimation from 2d images: A literature review,”Multimedia Tools and Applications, vol. 82, no. 16, pp. 24 605–24 643, 2023

work page 2023
[29]

Towards markerless surgical tool and hand pose estimation,

J. Hein, M. Seibold, F. Bogo, M. Farshad, M. Pollefeys, P. F ¨urnstahl, and N. Navab, “Towards markerless surgical tool and hand pose estimation,” International journal of computer assisted radiology and surgery, vol. 16, pp. 799–808, 2021

work page 2021
[30]

Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,

Z. Li, H. Shu, R. Liang, A. Goodridge, M. Sahu, F. X. Creighton, R. H. Taylor, and M. Unberath, “Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,” International Journal of Computer Assisted Radiology and Surgery , vol. 18, no. 7, pp. 1303– 1310, 2023

work page 2023
[31]

OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,

T. Teufel, H. Shu, R. D. Soberanis-Mukul, J. E. Mangulabnan, M. Sahu, S. S. Vedula, M. Ishii, G. Hager, R. H. Taylor, and M. Unberath, “OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024

work page 2024
[32]

CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,

H. Ding, J. Zhang, P. Kazanzides, J. Y . Wu, and M. Unberath, “CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 387–398

work page 2022
[33]

Rethinking causality- driven robot tool segmentation with temporal constraints,

H. Ding, J. Y . Wu, Z. Li, and M. Unberath, “Rethinking causality- driven robot tool segmentation with temporal constraints,” Interna- tional Journal of Computer Assisted Radiology and Surgery , pp. 1009 – 1016, 2022

work page 2022
[34]

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

H. Ding, T. Lu, Y . Zhang, R. Liang, H. Shu, L. Seenivasan, Y . Long, Q. Dou, C. Gao, and M. Unberath, “SegSTRONG-C: Segmenting surgical tools robustly on non-adversarial generated corruptions – an endovis’24 challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2407.11906

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 4015–4026

work page 2023
[36]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Foundationpose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879

work page 2024
[38]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

work page 2024
[39]

Enabling confidentiality in content- based publish/subscribe infrastructures,

C. Raiciu and D. S. Rosenblum, “Enabling confidentiality in content- based publish/subscribe infrastructures,” in 2006 securecomm and workshops. IEEE, 2006, pp. 1–11

work page 2006
[40]

Spatialtracker: Tracking any 2d pixels in 3d space,

Y . Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y . Shen, and X. Zhou, “Spatialtracker: Tracking any 2d pixels in 3d space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 20 406–20 417

work page 2024
[41]

arXiv preprint arXiv:2408.04098 (2024)

Y . Shen, H. Ding, X. Shao, and M. Unberath, “Performance and non- adversarial robustness of the segment anything model 2 in surgical video segmentation,” arXiv preprint arXiv:2408.04098 , 2024

work page arXiv 2024
[42]

From generalization to precision: exploring SAM for tool segmentation in surgical environments,

K. J. Oguine, R. D. S. Mukul, N. Drenkow, and M. Unberath, “From generalization to precision: exploring SAM for tool segmentation in surgical environments,” in Medical Imaging 2024: Image Processing , vol. 12926. SPIE, 2024, pp. 7–12

work page 2024
[43]

Roboclip: One demonstration is enough to learn robot policies,

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[44]

Bagging by learning to singulate layers using interactive perception,

L. Y . Chen, B. Shi, R. Lin, D. Seita, A. Ahmad, R. Cheng, T. Kollar, D. Held, and K. Goldberg, “Bagging by learning to singulate layers using interactive perception,” in 2023 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 3176–3183

work page 2023
[45]

Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,

Y . Li, F. Richter, J. Lu, E. K. Funk, R. K. Orosco, J. Zhu, and M. C. Yip, “Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2294–2301, 2020

work page 2020
[46]

An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,

W. Wang, H. Zhou, Y . Yan, X. Cheng, P. Yang, L. Gan, and S. Kuang, “An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,” The International Journal of Medical Robotics and Computer Assisted Surgery , vol. 19, no. 1, p. e2464, 2023

work page 2023
[47]

Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,

V . Schorp, W. Panitch, K. Shivakumar, V . Viswanath, J. Kerr, Y . Avi- gal, D. M. Fer, L. Ott, and K. Goldberg, “Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,” in 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE) . IEEE, 2023, pp. 1–6

work page 2023
[48]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 12 179–12 188

work page 2021
[50]

Tapir: Tracking any point with per- frame initialization and temporal refinement,

C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 10 061–10 072

work page 2023
[51]

OK-Robot: What really matters in integrating open-knowledge models for robotics,

P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot: What really matters in integrating open-knowledge models for robotics,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12202

work page arXiv 2024
[52]

Behavior trees in robotics and AI,

M. Colledanchise and P. ¨Ogren, “Behavior trees in robotics and AI,” July 2018. [Online]. Available: http://dx.doi.org/10.1201/ 9780429489105

work page 2018
[53]

Pddl— the planning domain definition language,

C. Aeronautiques, A. Howe, C. Knoblock, I. D. McDermott, A. Ram, M. Veloso, D. Weld, D. W. Sri, A. Barrett, D. Christianson, et al. , “Pddl— the planning domain definition language,” Technical Report, Tech. Rep., 1998

work page 1998
[54]

Pushing the envelope: planning, propo- sitional logic, and stochastic search,

H. Kautz and B. Selman, “Pushing the envelope: planning, propo- sitional logic, and stochastic search,” in Proceedings of the Thir- teenth National Conference on Artificial Intelligence - Volume 2 , ser. AAAI’96. AAAI Press, 1996, p. 1194–1201

work page 1996
[55]

Bandit based monte-carlo planning,

L. Kocsis and C. Szepesv ´ari, “Bandit based monte-carlo planning,” in Proceedings of the 17th European Conference on Machine Learning, ser. ECML’06. Berlin, Heidelberg: Springer-Verlag, 2006, p. 282–293. [Online]. Available: https://doi.org/10.1007/11871842 29

work page doi:10.1007/11871842 2006
[56]

Large language models as commonsense knowledge for large-scale task planning,

Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14078

work page arXiv 2023
[57]

Embodied task planning with large language models

Z. Wu, Z. Wang, X. Xu, J. Lu, and H. Yan, “Embodied task planning with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.01848

work page arXiv 2023
[58]

SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2307.06135

work page arXiv 2023
[59]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agar- wal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv, 2023

work page 2023
[60]

Empowering large language models on robotic manipulation with affordance prompting,

G. Cheng, C. Zhang, W. Cai, L. Zhao, C. Sun, and J. Bian, “Empowering large language models on robotic manipulation with affordance prompting,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.11027

work page arXiv 2024
[61]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” 2023. [Online]. Available: https://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Take a shot! natural language control of intelligent robotic x-ray systems in surgery,

B. D. Killeen, S. Chaudhary, G. Osgood, and M. Unberath, “Take a shot! natural language control of intelligent robotic x-ray systems in surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024

work page 2024
[64]

Software architecture of the da Vinci Research Kit,

Z. Chen, A. Deguet, R. H. Taylor, and P. Kazanzides, “Software architecture of the da Vinci Research Kit,” in IEEE International Conference on Robotic Computing (IRC) , 2017, pp. 180–187

work page 2017
[65]

dvrk camera registration,

“dvrk camera registration,” Sept. 2024, [Online; accessed 14. Sep. 2024]. [Online]. Available: https://github.com/jhu-dvrk/dvrk {$ $} camera{$ $}registration/tree/main

work page 2024
[66]

YOLOv8: A novel object detection algorithm with enhanced performance and robustness,

R. Varghese and M. Sambath, “YOLOv8: A novel object detection algorithm with enhanced performance and robustness,” in 2024 Inter- national Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS) . IEEE, 2024, pp. 1–6

work page 2024

[1] [1]

Robotic surgery: Review on minimally invasive techniques,

B. Johansson, E. Eriksson, N. Berglund, and I. Lindgren, “Robotic surgery: Review on minimally invasive techniques,” Fusion of Mul- tidisciplinary Research, An International Journal , vol. 2, no. 2, pp. 201–210, 2021

work page 2021

[2] [2]

A review on how da vinci surgical system is changing the health care,

N. Nath, “A review on how da vinci surgical system is changing the health care,” in The 2nd Advanced Manufacturing Student Conference (AMSC22) Chemnitz, Germany 07–08 July 2022 , vol. 7, 2022, p. 193

work page 2022

[3] [3]

An open-source research kit for the da Vinci® Surgical System,

P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da Vinci® Surgical System,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6434–6439

work page 2014

[4] [4]

SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,

M. Moghani, L. Doorenbos, W. C.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, and A. Garg, “SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,”arXiv preprint arXiv:2405.05226, 2024

work page arXiv 2024

[5] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

W.et al.Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks (2024)

J. W. Kim, T. Z. Zhao, S. Schmidgall, A. Deguet, M. Kobilarov, C. Finn, and A. Krieger, “Surgical robot transformer (srt): Imitation learning for surgical tasks,” arXiv preprint arXiv:2407.12998 , 2024

work page arXiv 2024

[8] [8]

Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,

J. Fu, Y . Long, K. Chen, W. Wei, and Q. Dou, “Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,” arXiv preprint arXiv:2405.18757, 2024

work page arXiv 2024

[9] [9]

Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,

M. Hwang, B. Thananjeyan, S. Paradis, D. Seita, J. Ichnowski, D. Fer, T. Low, and K. Goldberg, “Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,” IEEE Robotics and Automation Letters , vol. 5, no. 4, pp. 5937–5944, 2020

work page 2020

[10] [10]

Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,

M. Hwang, J. Ichnowski, B. Thananjeyan, D. Seita, S. Paradis, D. Fer, T. Low, and K. Goldberg, “Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,” IEEE Transactions on Automation Science and Engineering, vol. 20, no. 2, pp. 909–922, 2022

work page 2022

[11] [11]

STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,

K. Hari, H. Kim, W. Panitch, K. Srinivas, V . Schorp, K. Dharmarajan, S. Ganti, T. Sadjadpour, and K. Goldberg, “STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,” arXiv preprint arXiv:2404.05151 , 2024

work page arXiv 2024

[12] [12]

Autonomous robotic laparoscopic surgery for intestinal anastomosis,

H. Saeidi, J. D. Opfermann, M. Kam, S. Wei, S. L ´eonard, M. H. Hsieh, J. U. Kang, and A. Krieger, “Autonomous robotic laparoscopic surgery for intestinal anastomosis,” Science Robotics , vol. 7, no. 62, p. eabj2908, 2022

work page 2022

[13] [13]

Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,

M. Kam, S. Wei, J. D. Opfermann, H. Saeidi, M. H. Hsieh, K. C. Wang, J. U. Kang, and A. Krieger, “Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3916–3923, 2023

work page 2023

[14] [14]

Automating vascular shunt insertion with the dvrk surgical robot,

K. Dharmarajan, W. Panitch, M. Jiang, K. Srinivas, B. Shi, Y . Avigal, H. Huang, T. Low, D. Fer, and K. Goldberg, “Automating vascular shunt insertion with the dvrk surgical robot,” in 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 6781–6788

work page 2023

[15] [15]

Learning to localize, grasp, and hand over unmodified surgical needles,

A. Wilcox, J. Kerr, B. Thananjeyan, J. Ichnowski, M. Hwang, S. Par- adis, D. Fer, and K. Goldberg, “Learning to localize, grasp, and hand over unmodified surgical needles,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 9637–9643

work page 2022

[16] [16]

Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,

H. Ding, L. Seenivasan, B. D. Killeen, S. M. Cho, and M. Unberath, “Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,” ais, vol. 4, no. 3, pp. 109–138, 2024

work page 2024

[17] [17]

Twin-S: a digital twin for skull base surgery,

H. Shu, R. Liang, Z. Li, A. Goodridge, X. Zhang, H. Ding, N. Nagu- ruru, M. Sahu, F. X. Creighton, R. H. Taylor, et al., “Twin-S: a digital twin for skull base surgery,”International journal of computer assisted radiology and surgery , vol. 18, no. 6, pp. 1077–1084, 2023

work page 2023

[18] [18]

Creating a digital twin of spinal surgery: A proof of concept,

J. Hein, F. Giraud, L. Calvet, A. Schwarz, N. A. Cavalcanti, S. Prokudin, M. Farshad, S. Tang, M. Pollefeys, F. Carrillo, et al. , “Creating a digital twin of spinal surgery: A proof of concept,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2355–2364

work page 2024

[19] [19]

Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,

B. D. Killeen, H. Zhang, L. J. Wang, Z. Liu, C. Kleinbeck, M. Rosen, R. H. Taylor, G. Osgood, and M. Unberath, “Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,” International Journal of Computer Assisted Radiology and Surgery , pp. 1–10, 2024

work page 2024

[20] [20]

Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,

C. Kleinbeck, H. Zhang, B. D. Killeen, D. Roth, and M. Unberath, “Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,” Int. J. CARS, vol. 19, no. 7, pp. 1301–1312, July 2024

work page 2024

[21] [21]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969

work page 2017

[22] [22]

Hybrid task cascade for instance segmenta- tion,

K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 4974–4983

work page 2019

[23] [23]

Deeply shape-guided cascade for instance segmentation,

H. Ding, S. Qiao, A. Yuille, and W. Shen, “Deeply shape-guided cascade for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 8278–8288

work page 2021

[24] [24]

Pointrend: Image seg- mentation as rendering,

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9799–9808

work page 2020

[25] [25]

Masked-attention mask transformer for universal image segmenta- tion,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 1290–1299

work page 2022

[26] [26]

Pvnet: Pixel- wise voting network for 6dof pose estimation,

S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel- wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 4561–4570

work page 2019

[27] [27]

6d pose estimation of objects: Recent technologies and challenges,

Z. He, W. Feng, X. Zhao, and Y . Lv, “6d pose estimation of objects: Recent technologies and challenges,” Applied Sciences, vol. 11, no. 1, p. 228, 2020

work page 2020

[28] [28]

6d object position estimation from 2d images: A literature review,

G. Marullo, L. Tanzi, P. Piazzolla, and E. Vezzetti, “6d object position estimation from 2d images: A literature review,”Multimedia Tools and Applications, vol. 82, no. 16, pp. 24 605–24 643, 2023

work page 2023

[29] [29]

Towards markerless surgical tool and hand pose estimation,

J. Hein, M. Seibold, F. Bogo, M. Farshad, M. Pollefeys, P. F ¨urnstahl, and N. Navab, “Towards markerless surgical tool and hand pose estimation,” International journal of computer assisted radiology and surgery, vol. 16, pp. 799–808, 2021

work page 2021

[30] [30]

Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,

Z. Li, H. Shu, R. Liang, A. Goodridge, M. Sahu, F. X. Creighton, R. H. Taylor, and M. Unberath, “Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,” International Journal of Computer Assisted Radiology and Surgery , vol. 18, no. 7, pp. 1303– 1310, 2023

work page 2023

[31] [31]

OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,

T. Teufel, H. Shu, R. D. Soberanis-Mukul, J. E. Mangulabnan, M. Sahu, S. S. Vedula, M. Ishii, G. Hager, R. H. Taylor, and M. Unberath, “OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024

work page 2024

[32] [32]

CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,

H. Ding, J. Zhang, P. Kazanzides, J. Y . Wu, and M. Unberath, “CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 387–398

work page 2022

[33] [33]

Rethinking causality- driven robot tool segmentation with temporal constraints,

H. Ding, J. Y . Wu, Z. Li, and M. Unberath, “Rethinking causality- driven robot tool segmentation with temporal constraints,” Interna- tional Journal of Computer Assisted Radiology and Surgery , pp. 1009 – 1016, 2022

work page 2022

[34] [34]

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

H. Ding, T. Lu, Y . Zhang, R. Liang, H. Shu, L. Seenivasan, Y . Long, Q. Dou, C. Gao, and M. Unberath, “SegSTRONG-C: Segmenting surgical tools robustly on non-adversarial generated corruptions – an endovis’24 challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2407.11906

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 4015–4026

work page 2023

[36] [36]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Foundationpose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879

work page 2024

[38] [38]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

work page 2024

[39] [39]

Enabling confidentiality in content- based publish/subscribe infrastructures,

C. Raiciu and D. S. Rosenblum, “Enabling confidentiality in content- based publish/subscribe infrastructures,” in 2006 securecomm and workshops. IEEE, 2006, pp. 1–11

work page 2006

[40] [40]

Spatialtracker: Tracking any 2d pixels in 3d space,

Y . Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y . Shen, and X. Zhou, “Spatialtracker: Tracking any 2d pixels in 3d space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 20 406–20 417

work page 2024

[41] [41]

arXiv preprint arXiv:2408.04098 (2024)

Y . Shen, H. Ding, X. Shao, and M. Unberath, “Performance and non- adversarial robustness of the segment anything model 2 in surgical video segmentation,” arXiv preprint arXiv:2408.04098 , 2024

work page arXiv 2024

[42] [42]

From generalization to precision: exploring SAM for tool segmentation in surgical environments,

K. J. Oguine, R. D. S. Mukul, N. Drenkow, and M. Unberath, “From generalization to precision: exploring SAM for tool segmentation in surgical environments,” in Medical Imaging 2024: Image Processing , vol. 12926. SPIE, 2024, pp. 7–12

work page 2024

[43] [43]

Roboclip: One demonstration is enough to learn robot policies,

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[44] [44]

Bagging by learning to singulate layers using interactive perception,

L. Y . Chen, B. Shi, R. Lin, D. Seita, A. Ahmad, R. Cheng, T. Kollar, D. Held, and K. Goldberg, “Bagging by learning to singulate layers using interactive perception,” in 2023 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 3176–3183

work page 2023

[45] [45]

Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,

Y . Li, F. Richter, J. Lu, E. K. Funk, R. K. Orosco, J. Zhu, and M. C. Yip, “Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2294–2301, 2020

work page 2020

[46] [46]

An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,

W. Wang, H. Zhou, Y . Yan, X. Cheng, P. Yang, L. Gan, and S. Kuang, “An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,” The International Journal of Medical Robotics and Computer Assisted Surgery , vol. 19, no. 1, p. e2464, 2023

work page 2023

[47] [47]

Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,

V . Schorp, W. Panitch, K. Shivakumar, V . Viswanath, J. Kerr, Y . Avi- gal, D. M. Fer, L. Ott, and K. Goldberg, “Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,” in 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE) . IEEE, 2023, pp. 1–6

work page 2023

[48] [48]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 12 179–12 188

work page 2021

[50] [50]

Tapir: Tracking any point with per- frame initialization and temporal refinement,

C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 10 061–10 072

work page 2023

[51] [51]

OK-Robot: What really matters in integrating open-knowledge models for robotics,

P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot: What really matters in integrating open-knowledge models for robotics,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12202

work page arXiv 2024

[52] [52]

Behavior trees in robotics and AI,

M. Colledanchise and P. ¨Ogren, “Behavior trees in robotics and AI,” July 2018. [Online]. Available: http://dx.doi.org/10.1201/ 9780429489105

work page 2018

[53] [53]

Pddl— the planning domain definition language,

C. Aeronautiques, A. Howe, C. Knoblock, I. D. McDermott, A. Ram, M. Veloso, D. Weld, D. W. Sri, A. Barrett, D. Christianson, et al. , “Pddl— the planning domain definition language,” Technical Report, Tech. Rep., 1998

work page 1998

[54] [54]

Pushing the envelope: planning, propo- sitional logic, and stochastic search,

H. Kautz and B. Selman, “Pushing the envelope: planning, propo- sitional logic, and stochastic search,” in Proceedings of the Thir- teenth National Conference on Artificial Intelligence - Volume 2 , ser. AAAI’96. AAAI Press, 1996, p. 1194–1201

work page 1996

[55] [55]

Bandit based monte-carlo planning,

L. Kocsis and C. Szepesv ´ari, “Bandit based monte-carlo planning,” in Proceedings of the 17th European Conference on Machine Learning, ser. ECML’06. Berlin, Heidelberg: Springer-Verlag, 2006, p. 282–293. [Online]. Available: https://doi.org/10.1007/11871842 29

work page doi:10.1007/11871842 2006

[56] [56]

Large language models as commonsense knowledge for large-scale task planning,

Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14078

work page arXiv 2023

[57] [57]

Embodied task planning with large language models

Z. Wu, Z. Wang, X. Xu, J. Lu, and H. Yan, “Embodied task planning with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.01848

work page arXiv 2023

[58] [58]

SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2307.06135

work page arXiv 2023

[59] [59]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agar- wal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv, 2023

work page 2023

[60] [60]

Empowering large language models on robotic manipulation with affordance prompting,

G. Cheng, C. Zhang, W. Cai, L. Zhao, C. Sun, and J. Bian, “Empowering large language models on robotic manipulation with affordance prompting,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.11027

work page arXiv 2024

[61] [61]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” 2023. [Online]. Available: https://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Take a shot! natural language control of intelligent robotic x-ray systems in surgery,

B. D. Killeen, S. Chaudhary, G. Osgood, and M. Unberath, “Take a shot! natural language control of intelligent robotic x-ray systems in surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024

work page 2024

[64] [64]

Software architecture of the da Vinci Research Kit,

Z. Chen, A. Deguet, R. H. Taylor, and P. Kazanzides, “Software architecture of the da Vinci Research Kit,” in IEEE International Conference on Robotic Computing (IRC) , 2017, pp. 180–187

work page 2017

[65] [65]

dvrk camera registration,

“dvrk camera registration,” Sept. 2024, [Online; accessed 14. Sep. 2024]. [Online]. Available: https://github.com/jhu-dvrk/dvrk {$ $} camera{$ $}registration/tree/main

work page 2024

[66] [66]

YOLOv8: A novel object detection algorithm with enhanced performance and robustness,

R. Varghese and M. Sambath, “YOLOv8: A novel object detection algorithm with enhanced performance and robustness,” in 2024 Inter- national Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS) . IEEE, 2024, pp. 1–6

work page 2024