Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models
Pith reviewed 2026-05-23 20:17 UTC · model grok-4.3
The pith
Digital twin representations from vision foundation models enable LLM agents to plan surgical tasks with greater flexibility than prior limited perception methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that capitalizing on the performance and out-of-the-box generalization of recent vision foundation models yields a digital twin representation that supplies sufficiently detailed natural language scene information for LLM-based task planning; when this representation is combined with an LLM agent and deployed on the dVRK platform, the embodied system achieves strong task performance and generalizability to varied environmental settings on peg transfer and gauze retrieval.
What carries the argument
The digital twin-based machine perception approach that uses vision foundation models to convert visual input into detailed natural language scene representations for LLM planning.
If this is right
- The system performs peg transfer and gauze retrieval tasks with strong results.
- Performance holds across varied environmental settings.
- The approach provides greater flexibility than prior simple perception solutions for scaling to less constrained conditions.
- Integration of the digital twin representation with an LLM agent produces an embodied intelligence system on the dVRK platform.
Where Pith is reading between the lines
- The same digital twin construction could support planning for additional surgical tasks once the foundation models are shown to handle more complex scenes.
- A fuller digital twin framework might improve the ability to interpret why the LLM selects particular action sequences.
- Wider use of foundation-model digital twins could reduce the need for task-specific perception code in other robotic automation domains.
Load-bearing premise
Vision foundation models can supply natural language scene representations that are detailed and accurate enough for LLM agents to plan reliably in less constrained surgical environments.
What would settle it
A controlled test in which the digital twin representation produces incomplete or inaccurate scene descriptions, causing the LLM planner to generate incorrect action sequences during peg transfer or gauze retrieval under changed lighting, object arrangements, or instrument variations.
Figures
read the original abstract
Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments, but lacks the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin (DT)-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our DT representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environmental settings. Despite a convincing performance, this work is merely a first step towards the integration of DT representations. Future studies are necessary for the realization of a comprehensive DT framework to improve the interpretability and generalizability of embodied intelligence in surgery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using digital twin (DT) representations generated by vision foundation models to supply detailed natural-language scene descriptions for LLM-based agents in surgical task planning. It integrates this perception module with an LLM planner and the dVRK platform to create an embodied system, then evaluates robustness on peg-transfer and gauze-retrieval tasks, claiming strong performance and generalizability to varied settings while describing the work as an initial step toward a fuller DT framework.
Significance. If the empirical claims are substantiated, the work would demonstrate a practical route to more flexible perception for LLM agents in robotic surgery, moving beyond the simple, constrained perception pipelines used in prior bench-top studies. The explicit framing as a first integration of foundation-model DTs with dVRK planning is a modest but concrete contribution to embodied surgical intelligence.
major comments (2)
- [Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.
- [Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.
minor comments (1)
- [Abstract] The abstract contains a minor grammatical issue: 'developing similarly powerful and robust perception algorithms is necessary' should read 'developing ... algorithms are necessary' for subject-verb agreement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on strengthening the empirical substantiation in our abstract and results. We address the major comments point-by-point below and will revise the manuscript to improve clarity and verifiability of claims.
read point-by-point responses
-
Referee: [Abstract] Abstract (and, by the reader's report, the results section): the central claim of 'strong task performance and generalizability' is asserted without any quantitative metrics, success rates, error bars, dataset sizes, or exclusion criteria. This absence prevents verification of the robustness and generalizability assertions that are load-bearing for the paper's contribution.
Authors: We agree that the abstract (and potentially the results presentation) would be strengthened by including explicit quantitative metrics. The manuscript describes evaluations on peg transfer and gauze retrieval using the dVRK, but we will revise the abstract to report key figures such as success rates across trials, number of environmental variations tested, and any statistical details. We will also ensure the results section explicitly lists trial counts, exclusion criteria if any, and error analysis to allow verification of the generalizability claims. revision: yes
-
Referee: [Abstract] The weakest assumption identified by the reader—that vision foundation models already supply sufficiently detailed and accurate natural-language scene representations for reliable LLM planning in less-constrained surgical scenes—is stated but not tested with any ablation or failure-case analysis in the supplied abstract.
Authors: The abstract summarizes the DT perception approach without detailing ablations. The full evaluation on the two tasks with the integrated LLM planner provides implicit evidence of the perception module's utility in enabling planning. We will revise the abstract to qualify the assumption more explicitly and add a brief failure-case discussion or reference to robustness testing in the results section. A dedicated ablation study on the vision foundation model component alone is not present and would require additional experiments beyond the current scope. revision: partial
Circularity Check
No significant circularity
full rationale
The paper describes an empirical system integration of vision-foundation-model-derived digital twin scene representations with an LLM planner, evaluated on peg-transfer and gauze-retrieval tasks on the dVRK. No equations, parameter-fitting steps, uniqueness theorems, or self-citational load-bearing premises appear in the derivation chain; the central claim is that the integrated system exhibits task performance and generalizability in the reported experiments. This is a self-contained empirical demonstration rather than a reduction of any prediction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision foundation models deliver out-of-the-box generalization sufficient to produce detailed natural-language scene representations usable by LLM planners in surgical settings.
Forward citations
Cited by 2 Pith papers
-
TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research
TwinOR creates dynamic photorealistic digital twins of operating rooms that generate realistic RGB and depth data enabling embodied AI perception and localization tasks to match real-world performance levels.
-
SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge
SegSTRONG-C provides a new benchmark where top models reach 0.9394 DSC and 0.9301 NSD on corrupted surgical tool segmentation tests, showing conventional techniques help but calling for more innovative robustness methods.
Reference graph
Works this paper leans on
-
[1]
Robotic surgery: Review on minimally invasive techniques,
B. Johansson, E. Eriksson, N. Berglund, and I. Lindgren, “Robotic surgery: Review on minimally invasive techniques,” Fusion of Mul- tidisciplinary Research, An International Journal , vol. 2, no. 2, pp. 201–210, 2021
work page 2021
-
[2]
A review on how da vinci surgical system is changing the health care,
N. Nath, “A review on how da vinci surgical system is changing the health care,” in The 2nd Advanced Manufacturing Student Conference (AMSC22) Chemnitz, Germany 07–08 July 2022 , vol. 7, 2022, p. 193
work page 2022
-
[3]
An open-source research kit for the da Vinci® Surgical System,
P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da Vinci® Surgical System,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6434–6439
work page 2014
-
[4]
SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,
M. Moghani, L. Doorenbos, W. C.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, and A. Garg, “SuFIA: Language-guided augmented dex- terity for robotic surgical assistants,”arXiv preprint arXiv:2405.05226, 2024
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
W.et al.Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks (2024)
J. W. Kim, T. Z. Zhao, S. Schmidgall, A. Deguet, M. Kobilarov, C. Finn, and A. Krieger, “Surgical robot transformer (srt): Imitation learning for surgical tasks,” arXiv preprint arXiv:2407.12998 , 2024
-
[8]
J. Fu, Y . Long, K. Chen, W. Wei, and Q. Dou, “Multi-objective cross- task learning via goal-conditioned GPT-based decision transformers for surgical robot task automation,” arXiv preprint arXiv:2405.18757, 2024
-
[9]
M. Hwang, B. Thananjeyan, S. Paradis, D. Seita, J. Ichnowski, D. Fer, T. Low, and K. Goldberg, “Efficiently calibrating cable-driven surgical robots with RGBD fiducial sensing and recurrent neural networks,” IEEE Robotics and Automation Letters , vol. 5, no. 4, pp. 5937–5944, 2020
work page 2020
-
[10]
M. Hwang, J. Ichnowski, B. Thananjeyan, D. Seita, S. Paradis, D. Fer, T. Low, and K. Goldberg, “Automating surgical peg transfer: Calibra- tion with deep learning can exceed speed, accuracy, and consistency of humans,” IEEE Transactions on Automation Science and Engineering, vol. 20, no. 2, pp. 909–922, 2022
work page 2022
-
[11]
STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,
K. Hari, H. Kim, W. Panitch, K. Srinivas, V . Schorp, K. Dharmarajan, S. Ganti, T. Sadjadpour, and K. Goldberg, “STITCH: Augmented dex- terity for suture throws including thread coordination and handoffs,” arXiv preprint arXiv:2404.05151 , 2024
-
[12]
Autonomous robotic laparoscopic surgery for intestinal anastomosis,
H. Saeidi, J. D. Opfermann, M. Kam, S. Wei, S. L ´eonard, M. H. Hsieh, J. U. Kang, and A. Krieger, “Autonomous robotic laparoscopic surgery for intestinal anastomosis,” Science Robotics , vol. 7, no. 62, p. eabj2908, 2022
work page 2022
-
[13]
M. Kam, S. Wei, J. D. Opfermann, H. Saeidi, M. H. Hsieh, K. C. Wang, J. U. Kang, and A. Krieger, “Autonomous system for vaginal cuff closure via model-based planning and markerless tracking tech- niques,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3916–3923, 2023
work page 2023
-
[14]
Automating vascular shunt insertion with the dvrk surgical robot,
K. Dharmarajan, W. Panitch, M. Jiang, K. Srinivas, B. Shi, Y . Avigal, H. Huang, T. Low, D. Fer, and K. Goldberg, “Automating vascular shunt insertion with the dvrk surgical robot,” in 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 6781–6788
work page 2023
-
[15]
Learning to localize, grasp, and hand over unmodified surgical needles,
A. Wilcox, J. Kerr, B. Thananjeyan, J. Ichnowski, M. Hwang, S. Par- adis, D. Fer, and K. Goldberg, “Learning to localize, grasp, and hand over unmodified surgical needles,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 9637–9643
work page 2022
-
[16]
H. Ding, L. Seenivasan, B. D. Killeen, S. M. Cho, and M. Unberath, “Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding,” ais, vol. 4, no. 3, pp. 109–138, 2024
work page 2024
-
[17]
Twin-S: a digital twin for skull base surgery,
H. Shu, R. Liang, Z. Li, A. Goodridge, X. Zhang, H. Ding, N. Nagu- ruru, M. Sahu, F. X. Creighton, R. H. Taylor, et al., “Twin-S: a digital twin for skull base surgery,”International journal of computer assisted radiology and surgery , vol. 18, no. 6, pp. 1077–1084, 2023
work page 2023
-
[18]
Creating a digital twin of spinal surgery: A proof of concept,
J. Hein, F. Giraud, L. Calvet, A. Schwarz, N. A. Cavalcanti, S. Prokudin, M. Farshad, S. Tang, M. Pollefeys, F. Carrillo, et al. , “Creating a digital twin of spinal surgery: A proof of concept,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2355–2364
work page 2024
-
[19]
Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,
B. D. Killeen, H. Zhang, L. J. Wang, Z. Liu, C. Kleinbeck, M. Rosen, R. H. Taylor, G. Osgood, and M. Unberath, “Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery,” International Journal of Computer Assisted Radiology and Surgery , pp. 1–10, 2024
work page 2024
-
[20]
C. Kleinbeck, H. Zhang, B. D. Killeen, D. Roth, and M. Unberath, “Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality,” Int. J. CARS, vol. 19, no. 7, pp. 1301–1312, July 2024
work page 2024
-
[21]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969
work page 2017
-
[22]
Hybrid task cascade for instance segmenta- tion,
K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 4974–4983
work page 2019
-
[23]
Deeply shape-guided cascade for instance segmentation,
H. Ding, S. Qiao, A. Yuille, and W. Shen, “Deeply shape-guided cascade for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 8278–8288
work page 2021
-
[24]
Pointrend: Image seg- mentation as rendering,
A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9799–9808
work page 2020
-
[25]
Masked-attention mask transformer for universal image segmenta- tion,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 1290–1299
work page 2022
-
[26]
Pvnet: Pixel- wise voting network for 6dof pose estimation,
S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel- wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 4561–4570
work page 2019
-
[27]
6d pose estimation of objects: Recent technologies and challenges,
Z. He, W. Feng, X. Zhao, and Y . Lv, “6d pose estimation of objects: Recent technologies and challenges,” Applied Sciences, vol. 11, no. 1, p. 228, 2020
work page 2020
-
[28]
6d object position estimation from 2d images: A literature review,
G. Marullo, L. Tanzi, P. Piazzolla, and E. Vezzetti, “6d object position estimation from 2d images: A literature review,”Multimedia Tools and Applications, vol. 82, no. 16, pp. 24 605–24 643, 2023
work page 2023
-
[29]
Towards markerless surgical tool and hand pose estimation,
J. Hein, M. Seibold, F. Bogo, M. Farshad, M. Pollefeys, P. F ¨urnstahl, and N. Navab, “Towards markerless surgical tool and hand pose estimation,” International journal of computer assisted radiology and surgery, vol. 16, pp. 799–808, 2021
work page 2021
-
[30]
Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,
Z. Li, H. Shu, R. Liang, A. Goodridge, M. Sahu, F. X. Creighton, R. H. Taylor, and M. Unberath, “Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,” International Journal of Computer Assisted Radiology and Surgery , vol. 18, no. 7, pp. 1303– 1310, 2023
work page 2023
-
[31]
T. Teufel, H. Shu, R. D. Soberanis-Mukul, J. E. Mangulabnan, M. Sahu, S. S. Vedula, M. Ishii, G. Hager, R. H. Taylor, and M. Unberath, “OneSLAM to map them all: a generalized approach to SLAM for monocular endoscopic imaging based on tracking any point,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024
work page 2024
-
[32]
CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,
H. Ding, J. Zhang, P. Kazanzides, J. Y . Wu, and M. Unberath, “CaRTS: Causality-driven robot tool segmentation from vision and kinematics data,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 387–398
work page 2022
-
[33]
Rethinking causality- driven robot tool segmentation with temporal constraints,
H. Ding, J. Y . Wu, Z. Li, and M. Unberath, “Rethinking causality- driven robot tool segmentation with temporal constraints,” Interna- tional Journal of Computer Assisted Radiology and Surgery , pp. 1009 – 1016, 2022
work page 2022
-
[34]
H. Ding, T. Lu, Y . Zhang, R. Liang, H. Shu, L. Seenivasan, Y . Long, Q. Dou, C. Gao, and M. Unberath, “SegSTRONG-C: Segmenting surgical tools robustly on non-adversarial generated corruptions – an endovis’24 challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2407.11906
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 4015–4026
work page 2023
-
[36]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Foundationpose: Unified 6d pose estimation and tracking of novel objects,
B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879
work page 2024
-
[38]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381
work page 2024
-
[39]
Enabling confidentiality in content- based publish/subscribe infrastructures,
C. Raiciu and D. S. Rosenblum, “Enabling confidentiality in content- based publish/subscribe infrastructures,” in 2006 securecomm and workshops. IEEE, 2006, pp. 1–11
work page 2006
-
[40]
Spatialtracker: Tracking any 2d pixels in 3d space,
Y . Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y . Shen, and X. Zhou, “Spatialtracker: Tracking any 2d pixels in 3d space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 20 406–20 417
work page 2024
-
[41]
arXiv preprint arXiv:2408.04098 (2024)
Y . Shen, H. Ding, X. Shao, and M. Unberath, “Performance and non- adversarial robustness of the segment anything model 2 in surgical video segmentation,” arXiv preprint arXiv:2408.04098 , 2024
-
[42]
From generalization to precision: exploring SAM for tool segmentation in surgical environments,
K. J. Oguine, R. D. S. Mukul, N. Drenkow, and M. Unberath, “From generalization to precision: exploring SAM for tool segmentation in surgical environments,” in Medical Imaging 2024: Image Processing , vol. 12926. SPIE, 2024, pp. 7–12
work page 2024
-
[43]
Roboclip: One demonstration is enough to learn robot policies,
S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” Advances in Neural Information Processing Systems , vol. 36, 2024
work page 2024
-
[44]
Bagging by learning to singulate layers using interactive perception,
L. Y . Chen, B. Shi, R. Lin, D. Seita, A. Ahmad, R. Cheng, T. Kollar, D. Held, and K. Goldberg, “Bagging by learning to singulate layers using interactive perception,” in 2023 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 3176–3183
work page 2023
-
[45]
Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,
Y . Li, F. Richter, J. Lu, E. K. Funk, R. K. Orosco, J. Zhu, and M. C. Yip, “Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2294–2301, 2020
work page 2020
-
[46]
W. Wang, H. Zhou, Y . Yan, X. Cheng, P. Yang, L. Gan, and S. Kuang, “An automatic extraction method on medical feature points based on PointNet++ for robot-assisted knee arthroplasty,” The International Journal of Medical Robotics and Computer Assisted Surgery , vol. 19, no. 1, p. e2464, 2023
work page 2023
-
[47]
V . Schorp, W. Panitch, K. Shivakumar, V . Viswanath, J. Kerr, Y . Avi- gal, D. M. Fer, L. Ott, and K. Goldberg, “Self-supervised learning for interactive perception of surgical thread for autonomous suture tail-shortening,” in 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE) . IEEE, 2023, pp. 1–6
work page 2023
-
[48]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 12 179–12 188
work page 2021
-
[50]
Tapir: Tracking any point with per- frame initialization and temporal refinement,
C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 10 061–10 072
work page 2023
-
[51]
OK-Robot: What really matters in integrating open-knowledge models for robotics,
P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot: What really matters in integrating open-knowledge models for robotics,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12202
-
[52]
Behavior trees in robotics and AI,
M. Colledanchise and P. ¨Ogren, “Behavior trees in robotics and AI,” July 2018. [Online]. Available: http://dx.doi.org/10.1201/ 9780429489105
work page 2018
-
[53]
Pddl— the planning domain definition language,
C. Aeronautiques, A. Howe, C. Knoblock, I. D. McDermott, A. Ram, M. Veloso, D. Weld, D. W. Sri, A. Barrett, D. Christianson, et al. , “Pddl— the planning domain definition language,” Technical Report, Tech. Rep., 1998
work page 1998
-
[54]
Pushing the envelope: planning, propo- sitional logic, and stochastic search,
H. Kautz and B. Selman, “Pushing the envelope: planning, propo- sitional logic, and stochastic search,” in Proceedings of the Thir- teenth National Conference on Artificial Intelligence - Volume 2 , ser. AAAI’96. AAAI Press, 1996, p. 1194–1201
work page 1996
-
[55]
Bandit based monte-carlo planning,
L. Kocsis and C. Szepesv ´ari, “Bandit based monte-carlo planning,” in Proceedings of the 17th European Conference on Machine Learning, ser. ECML’06. Berlin, Heidelberg: Springer-Verlag, 2006, p. 282–293. [Online]. Available: https://doi.org/10.1007/11871842 29
-
[56]
Large language models as commonsense knowledge for large-scale task planning,
Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14078
-
[57]
Embodied task planning with large language models
Z. Wu, Z. Wang, X. Xu, J. Lu, and H. Yan, “Embodied task planning with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.01848
-
[58]
SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,
K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,” 2023. [Online]. Available: https://arxiv.org/abs/2307.06135
-
[59]
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,
Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agar- wal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv, 2023
work page 2023
-
[60]
Empowering large language models on robotic manipulation with affordance prompting,
G. Cheng, C. Zhang, W. Cai, L. Zhao, C. Sun, and J. Bian, “Empowering large language models on robotic manipulation with affordance prompting,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.11027
-
[61]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” 2023. [Online]. Available: https://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Take a shot! natural language control of intelligent robotic x-ray systems in surgery,
B. D. Killeen, S. Chaudhary, G. Osgood, and M. Unberath, “Take a shot! natural language control of intelligent robotic x-ray systems in surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024
work page 2024
-
[64]
Software architecture of the da Vinci Research Kit,
Z. Chen, A. Deguet, R. H. Taylor, and P. Kazanzides, “Software architecture of the da Vinci Research Kit,” in IEEE International Conference on Robotic Computing (IRC) , 2017, pp. 180–187
work page 2017
-
[65]
“dvrk camera registration,” Sept. 2024, [Online; accessed 14. Sep. 2024]. [Online]. Available: https://github.com/jhu-dvrk/dvrk {$ $} camera{$ $}registration/tree/main
work page 2024
-
[66]
YOLOv8: A novel object detection algorithm with enhanced performance and robustness,
R. Varghese and M. Sambath, “YOLOv8: A novel object detection algorithm with enhanced performance and robustness,” in 2024 Inter- national Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS) . IEEE, 2024, pp. 1–6
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.