pith. sign in

arxiv: 2409.13107 · v3 · submitted 2024-09-19 · 💻 cs.RO

Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Pith reviewed 2026-05-23 20:17 UTC · model grok-4.3

classification 💻 cs.RO
keywords digital twinvision foundation modelsLLM agentssurgical automationembodied intelligencepeg transfergauze retrievaldVRK platform
0
0 comments X

The pith

Digital twin representations from vision foundation models enable LLM agents to plan surgical tasks with greater flexibility than prior limited perception methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a digital twin perception approach built on vision foundation models can generate the detailed natural language scene descriptions required for LLM agents to plan complex surgical action sequences. Earlier work depended on simple perception solutions that sufficed only for tightly controlled bench-top tests and could not scale to less constrained environments. The authors integrate their digital twin representation with an LLM planner and the dVRK platform, then evaluate the resulting system on peg transfer and gauze retrieval tasks. A sympathetic reader would care because reliable scene understanding is the missing piece that could let embodied surgical systems handle real variability without custom engineering for each setting.

Core claim

The paper claims that capitalizing on the performance and out-of-the-box generalization of recent vision foundation models yields a digital twin representation that supplies sufficiently detailed natural language scene information for LLM-based task planning; when this representation is combined with an LLM agent and deployed on the dVRK platform, the embodied system achieves strong task performance and generalizability to varied environmental settings on peg transfer and gauze retrieval.

What carries the argument

The digital twin-based machine perception approach that uses vision foundation models to convert visual input into detailed natural language scene representations for LLM planning.

If this is right

  • The system performs peg transfer and gauze retrieval tasks with strong results.
  • Performance holds across varied environmental settings.
  • The approach provides greater flexibility than prior simple perception solutions for scaling to less constrained conditions.
  • Integration of the digital twin representation with an LLM agent produces an embodied intelligence system on the dVRK platform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same digital twin construction could support planning for additional surgical tasks once the foundation models are shown to handle more complex scenes.
  • A fuller digital twin framework might improve the ability to interpret why the LLM selects particular action sequences.
  • Wider use of foundation-model digital twins could reduce the need for task-specific perception code in other robotic automation domains.

Load-bearing premise

Vision foundation models can supply natural language scene representations that are detailed and accurate enough for LLM agents to plan reliably in less constrained surgical environments.

What would settle it

A controlled test in which the digital twin representation produces incomplete or inaccurate scene descriptions, causing the LLM planner to generate incorrect action sequences during peg transfer or gauze retrieval under changed lighting, object arrangements, or instrument variations.

Figures

Figures reproduced from arXiv: 2409.13107 by Grayson Byrd, Han Zhang, Hao Ding, Hongchao Shu, Juan Antonio Barragan, Lalithkumar Seenivasan, Mathias Unberath, Peter Kazanzides, Pu Xiao, Russell H. Taylor.

Figure 1
Figure 1. Figure 1: Illustration of the digital twin-based embodied surgical system. A machine perception module is applied to extract digital twin-based scene [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the workflow of the proposed embodied surgical system with digital twin-based machine perception. The captured image is first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of physical setup and varied experimental environment. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments, but lacks the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin (DT)-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our DT representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environmental settings. Despite a convincing performance, this work is merely a first step towards the integration of DT representations. Future studies are necessary for the realization of a comprehensive DT framework to improve the interpretability and generalizability of embodied intelligence in surgery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes using digital twin (DT) representations generated by vision foundation models to supply detailed natural-language scene descriptions for LLM-based agents in surgical task planning. It integrates this perception module with an LLM planner and the dVRK platform to create an embodied system, then evaluates robustness on peg-transfer and gauze-retrieval tasks, claiming strong performance and generalizability to varied settings while describing the work as an initial step toward a fuller DT framework.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route to more flexible perception for LLM agents in robotic surgery, moving beyond the simple, constrained perception pipelines used in prior bench-top studies. The explicit framing as a first integration of foundation-model DTs with dVRK planning is a modest but concrete contribution to embodied surgical intelligence.

major comments (2)
  1. [Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.
  2. [Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.
minor comments (1)
  1. [Abstract] The abstract contains a minor grammatical issue: 'developing similarly powerful and robust perception algorithms is necessary' should read 'developing ... algorithms are necessary' for subject-verb agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical substantiation in our abstract and results. We address the major comments point-by-point below and will revise the manuscript to improve clarity and verifiability of claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.

    Authors: We agree that the abstract (and potentially the results presentation) would be strengthened by including explicit quantitative metrics. The manuscript describes evaluations on peg transfer and gauze retrieval using the dVRK, but we will revise the abstract to report key figures such as success rates across trials, number of environmental variations tested, and any statistical details. We will also ensure the results section explicitly lists trial counts, exclusion criteria if any, and error analysis to allow verification of the generalizability claims. revision: yes

  2. Referee: [Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.

    Authors: The abstract summarizes the DT perception approach without detailing ablations. The full evaluation on the two tasks with the integrated LLM planner provides implicit evidence of the perception module's utility in enabling planning. We will revise the abstract to qualify the assumption more explicitly and add a brief failure-case discussion or reference to robustness testing in the results section. A dedicated ablation study on the vision foundation model component alone is not present and would require additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical system integration of vision-foundation-model-derived digital twin scene representations with an LLM planner, evaluated on peg-transfer and gauze-retrieval tasks on the dVRK. No equations, parameter-fitting steps, uniqueness theorems, or self-citational load-bearing premises appear in the derivation chain; the central claim is that the integrated system exhibits task performance and generalizability in the reported experiments. This is a self-contained empirical demonstration rather than a reduction of any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is limited to the central modeling assumption stated in the text.

axioms (1)
  • domain assumption Vision foundation models deliver out-of-the-box generalization sufficient to produce detailed natural-language scene representations usable by LLM planners in surgical settings.
    Abstract explicitly states the approach capitalizes on this property to overcome limitations of prior perception solutions.

pith-pipeline@v0.9.0 · 5808 in / 1139 out tokens · 23724 ms · 2026-05-23T20:17:13.233943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research

    cs.CV 2025-11 unverdicted novelty 5.0

    TwinOR creates dynamic photorealistic digital twins of operating rooms that generate realistic RGB and depth data enabling embodied AI perception and localization tasks to match real-world performance levels.

  2. SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

    cs.CV 2024-07 accept novelty 5.0

    SegSTRONG-C provides a new benchmark where top models reach 0.9394 DSC and 0.9301 NSD on corrupted surgical tool segmentation tests, showing conventional techniques help but calling for more innovative robustness methods.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Robotic surgery: Review on minimally invasive techniques,

    B. Johansson, E. Eriksson, N. Berglund, and I. Lindgren, “Robotic surgery: Review on minimally invasive techniques,” Fusion of Mul- tidisciplinary Research, An International Journal , vol. 2, no. 2, pp. 201–210, 2021

  2. [2]

    A review on how da vinci surgical system is changing the health care,

    N. Nath, “A review on how da vinci surgical system is changing the health care,” in The 2nd Advanced Manufacturing Student Conference (AMSC22) Chemnitz, Germany 07–08 July 2022 , vol. 7, 2022, p. 193

  3. [3]

    An open-source research kit for the da Vinci® Surgical System,

    P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da Vinci® Surgical System,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6434–6439

  4. [4]

    SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,

    M. Moghani, L. Doorenbos, W. C.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, and A. Garg, “SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,”arXiv preprint arXiv:2405.05226, 2024

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818 , 2023

  7. [7]

    W.et al.Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks (2024)

    J. W. Kim, T. Z. Zhao, S. Schmidgall, A. Deguet, M. Kobilarov, C. Finn, and A. Krieger, “Surgical robot transformer (srt): Imitation learning for surgical tasks,” arXiv preprint arXiv:2407.12998 , 2024

  8. [8]

    Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,

    J. Fu, Y . Long, K. Chen, W. Wei, and Q. Dou, “Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,” arXiv preprint arXiv:2405.18757, 2024

  9. [9]

    Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,

    M. Hwang, B. Thananjeyan, S. Paradis, D. Seita, J. Ichnowski, D. Fer, T. Low, and K. Goldberg, “Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,” IEEE Robotics and Automation Letters , vol. 5, no. 4, pp. 5937–5944, 2020

  10. [10]

    Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,

    M. Hwang, J. Ichnowski, B. Thananjeyan, D. Seita, S. Paradis, D. Fer, T. Low, and K. Goldberg, “Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,” IEEE Transactions on Automation Science and Engineering, vol. 20, no. 2, pp. 909–922, 2022

  11. [11]

    STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,

    K. Hari, H. Kim, W. Panitch, K. Srinivas, V . Schorp, K. Dharmarajan, S. Ganti, T. Sadjadpour, and K. Goldberg, “STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,” arXiv preprint arXiv:2404.05151 , 2024

  12. [12]

    Autonomous robotic laparoscopic surgery for intestinal anastomosis,

    H. Saeidi, J. D. Opfermann, M. Kam, S. Wei, S. L ´eonard, M. H. Hsieh, J. U. Kang, and A. Krieger, “Autonomous robotic laparoscopic surgery for intestinal anastomosis,” Science Robotics , vol. 7, no. 62, p. eabj2908, 2022

  13. [13]

    Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,

    M. Kam, S. Wei, J. D. Opfermann, H. Saeidi, M. H. Hsieh, K. C. Wang, J. U. Kang, and A. Krieger, “Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3916–3923, 2023

  14. [14]

    Automating vascular shunt insertion with the dvrk surgical robot,

    K. Dharmarajan, W. Panitch, M. Jiang, K. Srinivas, B. Shi, Y . Avigal, H. Huang, T. Low, D. Fer, and K. Goldberg, “Automating vascular shunt insertion with the dvrk surgical robot,” in 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 6781–6788

  15. [15]

    Learning to localize, grasp, and hand over unmodified surgical needles,

    A. Wilcox, J. Kerr, B. Thananjeyan, J. Ichnowski, M. Hwang, S. Par- adis, D. Fer, and K. Goldberg, “Learning to localize, grasp, and hand over unmodified surgical needles,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 9637–9643

  16. [16]

    Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,

    H. Ding, L. Seenivasan, B. D. Killeen, S. M. Cho, and M. Unberath, “Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,” ais, vol. 4, no. 3, pp. 109–138, 2024

  17. [17]

    Twin-S: a digital twin for skull base surgery,

    H. Shu, R. Liang, Z. Li, A. Goodridge, X. Zhang, H. Ding, N. Nagu- ruru, M. Sahu, F. X. Creighton, R. H. Taylor, et al., “Twin-S: a digital twin for skull base surgery,”International journal of computer assisted radiology and surgery , vol. 18, no. 6, pp. 1077–1084, 2023

  18. [18]

    Creating a digital twin of spinal surgery: A proof of concept,

    J. Hein, F. Giraud, L. Calvet, A. Schwarz, N. A. Cavalcanti, S. Prokudin, M. Farshad, S. Tang, M. Pollefeys, F. Carrillo, et al. , “Creating a digital twin of spinal surgery: A proof of concept,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2355–2364

  19. [19]

    Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,

    B. D. Killeen, H. Zhang, L. J. Wang, Z. Liu, C. Kleinbeck, M. Rosen, R. H. Taylor, G. Osgood, and M. Unberath, “Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,” International Journal of Computer Assisted Radiology and Surgery , pp. 1–10, 2024

  20. [20]

    Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,

    C. Kleinbeck, H. Zhang, B. D. Killeen, D. Roth, and M. Unberath, “Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,” Int. J. CARS, vol. 19, no. 7, pp. 1301–1312, July 2024

  21. [21]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969

  22. [22]

    Hybrid task cascade for instance segmenta- tion,

    K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 4974–4983

  23. [23]

    Deeply shape-guided cascade for instance segmentation,

    H. Ding, S. Qiao, A. Yuille, and W. Shen, “Deeply shape-guided cascade for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 8278–8288

  24. [24]

    Pointrend: Image seg- mentation as rendering,

    A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9799–9808

  25. [25]

    Masked-attention mask transformer for universal image segmenta- tion,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 1290–1299

  26. [26]

    Pvnet: Pixel- wise voting network for 6dof pose estimation,

    S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel- wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 4561–4570

  27. [27]

    6d pose estimation of objects: Recent technologies and challenges,

    Z. He, W. Feng, X. Zhao, and Y . Lv, “6d pose estimation of objects: Recent technologies and challenges,” Applied Sciences, vol. 11, no. 1, p. 228, 2020

  28. [28]

    6d object position estimation from 2d images: A literature review,

    G. Marullo, L. Tanzi, P. Piazzolla, and E. Vezzetti, “6d object position estimation from 2d images: A literature review,”Multimedia Tools and Applications, vol. 82, no. 16, pp. 24 605–24 643, 2023

  29. [29]

    Towards markerless surgical tool and hand pose estimation,

    J. Hein, M. Seibold, F. Bogo, M. Farshad, M. Pollefeys, P. F ¨urnstahl, and N. Navab, “Towards markerless surgical tool and hand pose estimation,” International journal of computer assisted radiology and surgery, vol. 16, pp. 799–808, 2021

  30. [30]

    Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,

    Z. Li, H. Shu, R. Liang, A. Goodridge, M. Sahu, F. X. Creighton, R. H. Taylor, and M. Unberath, “Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,” International Journal of Computer Assisted Radiology and Surgery , vol. 18, no. 7, pp. 1303– 1310, 2023

  31. [31]

    OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,

    T. Teufel, H. Shu, R. D. Soberanis-Mukul, J. E. Mangulabnan, M. Sahu, S. S. Vedula, M. Ishii, G. Hager, R. H. Taylor, and M. Unberath, “OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024

  32. [32]

    CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,

    H. Ding, J. Zhang, P. Kazanzides, J. Y . Wu, and M. Unberath, “CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 387–398

  33. [33]

    Rethinking causality- driven robot tool segmentation with temporal constraints,

    H. Ding, J. Y . Wu, Z. Li, and M. Unberath, “Rethinking causality- driven robot tool segmentation with temporal constraints,” Interna- tional Journal of Computer Assisted Radiology and Surgery , pp. 1009 – 1016, 2022

  34. [34]

    SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

    H. Ding, T. Lu, Y . Zhang, R. Liang, H. Shu, L. Seenivasan, Y . Long, Q. Dou, C. Gao, and M. Unberath, “SegSTRONG-C: Segmenting surgical tools robustly on non-adversarial generated corruptions – an endovis’24 challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2407.11906

  35. [35]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 4015–4026

  36. [36]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024

  37. [37]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects,

    B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879

  38. [38]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

  39. [39]

    Enabling confidentiality in content- based publish/subscribe infrastructures,

    C. Raiciu and D. S. Rosenblum, “Enabling confidentiality in content- based publish/subscribe infrastructures,” in 2006 securecomm and workshops. IEEE, 2006, pp. 1–11

  40. [40]

    Spatialtracker: Tracking any 2d pixels in 3d space,

    Y . Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y . Shen, and X. Zhou, “Spatialtracker: Tracking any 2d pixels in 3d space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 20 406–20 417

  41. [41]

    arXiv preprint arXiv:2408.04098 (2024)

    Y . Shen, H. Ding, X. Shao, and M. Unberath, “Performance and non- adversarial robustness of the segment anything model 2 in surgical video segmentation,” arXiv preprint arXiv:2408.04098 , 2024

  42. [42]

    From generalization to precision: exploring SAM for tool segmentation in surgical environments,

    K. J. Oguine, R. D. S. Mukul, N. Drenkow, and M. Unberath, “From generalization to precision: exploring SAM for tool segmentation in surgical environments,” in Medical Imaging 2024: Image Processing , vol. 12926. SPIE, 2024, pp. 7–12

  43. [43]

    Roboclip: One demonstration is enough to learn robot policies,

    S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” Advances in Neural Information Processing Systems , vol. 36, 2024

  44. [44]

    Bagging by learning to singulate layers using interactive perception,

    L. Y . Chen, B. Shi, R. Lin, D. Seita, A. Ahmad, R. Cheng, T. Kollar, D. Held, and K. Goldberg, “Bagging by learning to singulate layers using interactive perception,” in 2023 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 3176–3183

  45. [45]

    Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,

    Y . Li, F. Richter, J. Lu, E. K. Funk, R. K. Orosco, J. Zhu, and M. C. Yip, “Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2294–2301, 2020

  46. [46]

    An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,

    W. Wang, H. Zhou, Y . Yan, X. Cheng, P. Yang, L. Gan, and S. Kuang, “An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,” The International Journal of Medical Robotics and Computer Assisted Surgery , vol. 19, no. 1, p. e2464, 2023

  47. [47]

    Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,

    V . Schorp, W. Panitch, K. Shivakumar, V . Viswanath, J. Kerr, Y . Avi- gal, D. M. Fer, L. Ott, and K. Goldberg, “Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,” in 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE) . IEEE, 2023, pp. 1–6

  48. [48]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

  49. [49]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 12 179–12 188

  50. [50]

    Tapir: Tracking any point with per- frame initialization and temporal refinement,

    C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 10 061–10 072

  51. [51]

    OK-Robot: What really matters in integrating open-knowledge models for robotics,

    P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot: What really matters in integrating open-knowledge models for robotics,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12202

  52. [52]

    Behavior trees in robotics and AI,

    M. Colledanchise and P. ¨Ogren, “Behavior trees in robotics and AI,” July 2018. [Online]. Available: http://dx.doi.org/10.1201/ 9780429489105

  53. [53]

    Pddl— the planning domain definition language,

    C. Aeronautiques, A. Howe, C. Knoblock, I. D. McDermott, A. Ram, M. Veloso, D. Weld, D. W. Sri, A. Barrett, D. Christianson, et al. , “Pddl— the planning domain definition language,” Technical Report, Tech. Rep., 1998

  54. [54]

    Pushing the envelope: planning, propo- sitional logic, and stochastic search,

    H. Kautz and B. Selman, “Pushing the envelope: planning, propo- sitional logic, and stochastic search,” in Proceedings of the Thir- teenth National Conference on Artificial Intelligence - Volume 2 , ser. AAAI’96. AAAI Press, 1996, p. 1194–1201

  55. [55]

    Bandit based monte-carlo planning,

    L. Kocsis and C. Szepesv ´ari, “Bandit based monte-carlo planning,” in Proceedings of the 17th European Conference on Machine Learning, ser. ECML’06. Berlin, Heidelberg: Springer-Verlag, 2006, p. 282–293. [Online]. Available: https://doi.org/10.1007/11871842 29

  56. [56]

    Large language models as commonsense knowledge for large-scale task planning,

    Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14078

  57. [57]

    Embodied task planning with large language models

    Z. Wu, Z. Wang, X. Xu, J. Lu, and H. Yan, “Embodied task planning with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.01848

  58. [58]

    SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2307.06135

  59. [59]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agar- wal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv, 2023

  60. [60]

    Empowering large language models on robotic manipulation with affordance prompting,

    G. Cheng, C. Zhang, W. Cai, L. Zhao, C. Sun, and J. Bian, “Empowering large language models on robotic manipulation with affordance prompting,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.11027

  61. [61]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” 2023. [Online]. Available: https://arxiv.org/abs/2307.16789

  62. [62]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

  63. [63]

    Take a shot! natural language control of intelligent robotic x-ray systems in surgery,

    B. D. Killeen, S. Chaudhary, G. Osgood, and M. Unberath, “Take a shot! natural language control of intelligent robotic x-ray systems in surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024

  64. [64]

    Software architecture of the da Vinci Research Kit,

    Z. Chen, A. Deguet, R. H. Taylor, and P. Kazanzides, “Software architecture of the da Vinci Research Kit,” in IEEE International Conference on Robotic Computing (IRC) , 2017, pp. 180–187

  65. [65]

    dvrk camera registration,

    “dvrk camera registration,” Sept. 2024, [Online; accessed 14. Sep. 2024]. [Online]. Available: https://github.com/jhu-dvrk/dvrk {$ $} camera{$ $}registration/tree/main

  66. [66]

    YOLOv8: A novel object detection algorithm with enhanced performance and robustness,

    R. Varghese and M. Sambath, “YOLOv8: A novel object detection algorithm with enhanced performance and robustness,” in 2024 Inter- national Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS) . IEEE, 2024, pp. 1–6