TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research

Angela Christine Argento; Ankita Ghosh; Chenjia Li; Han Zhang; Hao Ding; Jose L. Porras; Lalithkumar Seenivasan; Lonny Yarmus; Masaru Ishii; Mathias Unberath

arxiv: 2511.07412 · v2 · submitted 2025-11-10 · 💻 cs.CV · cs.RO

TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research

Han Zhang , Yiqing Shen , Roger D. Soberanis-Mukul , Ankita Ghosh , Hao Ding , Lalithkumar Seenivasan , Jose L. Porras , Zhekai Mao

show 6 more authors

Chenjia Li Wenjie Xiao Lonny Yarmus Angela Christine Argento Masaru Ishii Mathias Unberath

This is my paper

Pith reviewed 2026-05-17 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords digital twinsoperating roomsphotorealistic simulationembodied AIvisual localizationstereo reconstructionsurgical roboticsreal-to-sim

0 comments

The pith

TwinOR reconstructs operating rooms into dynamic 3D digital twins with centimeter accuracy for embodied AI training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TwinOR as a real-to-sim system that builds photorealistic digital twins of operating rooms by first capturing static geometry and then modeling ongoing human and equipment movements. These elements combine into a controllable 3D environment that produces synthetic stereo images, monocular video, and depth data. Experiments test standard models for stereo depth estimation and visual SLAM on the generated streams, finding that their accuracy stays within the ranges those models report on actual indoor datasets. This setup matters for embodied AI because it supplies a safe, repeatable space to develop and evaluate surgical agents without the regulatory and practical barriers of real operating rooms.

Core claim

TwinOR is a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows.

What carries the argument

The real-to-sim pipeline that reconstructs static OR geometry at centimeter accuracy and models continuous human and equipment motion, then fuses both to generate photorealistic sensor streams for perception and localization tasks.

If this is right

Perception models such as FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data that falls inside their published ranges on real indoor scenes.
Embodied AI systems for surgery can be trained and tested in a risk-free, fully controllable digital environment.
The pipeline supports automatic creation of multiple dynamic OR twins from real-world captures.
Synthetic sensor streams from TwinOR can substitute for real data in visual localization and geometry-understanding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the sensor realism extends to additional modalities such as force or audio, TwinOR could support full closed-loop robotic surgery simulation.
Large collections of TwinOR instances might supply training data for medical foundation models that generalize across different hospitals and procedures.
The same reconstruction-plus-motion approach could be applied to other constrained environments such as ICU bays or interventional suites.

Load-bearing premise

Reconstructed geometry and modeled motions, when fused, produce sensor observations whose statistical properties are close enough to real OR data that downstream AI models exhibit comparable behavior.

What would settle it

Deploy a model trained exclusively on TwinOR-generated data into a real operating room and measure whether its perception or localization error exceeds the error range observed when the same model is trained and tested on real indoor datasets.

read the original abstract

Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains an open challenge. We introduce TwinOR, a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows. In our experiments, TwinOR synthesizes stereo and monocular RGB streams as well as depth observations for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 evaluated on TwinOR-synthesized data achieve performance within their reported accuracy ranges on real-world indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for emulating real-world perception and localization challenge. By establishing a perception-grounded real-to-sim pipeline, TwinOR enables the automatic construction of dynamic, photorealistic digital twins of ORs. As a safe and scalable environment for experimentation, TwinOR opens new opportunities for translating embodied intelligence from simulation to real-world clinical environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TwinOR, a real-to-sim infrastructure for photorealistic dynamic digital twins of operating rooms. It reconstructs static OR geometry at centimeter-level accuracy, continuously models human and equipment motions, and fuses these into an immersive 3D environment supporting controllable RGB, stereo, and depth simulation. Experiments evaluate perception models (FoundationStereo, ORB-SLAM3) on the synthesized streams and report that their accuracy falls within previously published ranges for real indoor datasets such as TUM and EuRoC, from which the authors conclude that TwinOR achieves sensor-level realism sufficient to emulate real-world perception and localization challenges.

Significance. If the realism claim holds, TwinOR would provide a valuable, safe, and scalable testbed for embodied AI in surgical settings, addressing regulatory barriers to real OR experimentation. The fusion of static reconstruction with dynamic motion modeling and the use of downstream task performance as a proxy for fidelity are practical strengths that could accelerate translation from simulation to clinical environments.

major comments (2)

[Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.
[Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.

minor comments (2)

[Abstract] Specify the exact performance numbers and real-world dataset references (TUM, EuRoC, etc.) used for the 'within published ranges' comparison so readers can assess the tightness of the match.
Clarify the measurement protocol and error statistics that support the 'centimeter-level accuracy' claim for static geometry reconstruction.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our evaluation methodology and commit to revisions that strengthen the presentation of results while acknowledging practical constraints in OR data collection.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.

Authors: We agree that direct sensor-level comparisons would offer stronger substantiation. Our evaluation follows standard practice in sim-to-real research by using downstream task performance (stereo matching and visual SLAM) as a proxy for overall fidelity, which captures the combined effects of geometry, lighting, and dynamics on perception. In the revision we will add a dedicated limitations paragraph discussing this indirect approach, expand error distribution analysis from the reported experiments, and provide more detail on how the reconstruction pipeline models OR-specific elements such as specular surfaces and dynamic lighting changes. revision: partial
Referee: [Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.

Authors: We will revise the abstract and experiments section to include explicit quantitative metrics (e.g., absolute trajectory error for ORB-SLAM3 and disparity errors for FoundationStereo) with direct numerical comparisons to the cited real-world ranges. We will also add ablation studies that isolate the contribution of the dynamic motion models and clarify the validation of those models against the ground-truth trajectories obtained during the real-to-sim reconstruction process. revision: yes

standing simulated objections not resolved

Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures

Circularity Check

0 steps flagged

No significant circularity; validation uses external benchmarks

full rationale

The paper describes a reconstruction pipeline that builds static OR geometry to centimeter accuracy, models continuous human/equipment motion, and fuses both into a controllable 3D simulation environment. The central claim of sensor-level realism is supported by running FoundationStereo and ORB-SLAM3 on the synthesized RGB/depth streams and noting that their accuracy falls inside previously published ranges for real indoor datasets (TUM, EuRoC). This comparison draws on independent external results rather than any internal fit, self-definition, or self-citation that would make the outcome equivalent to the construction inputs by construction. No load-bearing step in the abstract or described workflow reduces to a renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the system appears to rely on standard computer-vision reconstruction and tracking techniques without introducing new physical entities.

pith-pipeline@v0.9.0 · 5633 in / 1093 out tokens · 54289 ms · 2026-05-17T23:10:29.632770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TwinOR reconstructs static geometry... continuously models human and equipment motion... fused into an immersive 3D environment... FoundationStereo and ORB-SLAM3... within their reported accuracy ranges on real-world indoor datasets.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

centimeter-level accuracy... SSIM 0.90/0.92... MPJPE of 3.52 cm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 7.0

GEAR is an EM-style alternating optimization framework that jointly models geometry and motion in Gaussian Splatting to improve reconstruction of complex articulated objects.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

¨Ozsoy, E.,et al.: Oracle: Large vision-language models for knowledge-guided holis- tic or domain modeling. arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

work page doi:10.48550/arxiv.2404.07031 2024
[2]

In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

Killeen, B.D., et al.: FluoroSAM: A Language-Promptable Foundation Model for Flexible X-Ray Image Segmentation. In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

work page 2025
[3]

Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

Zhang, H.,et al.: StraightTrack: Towards mixed reality navigation system for percutaneous K-wire insertion. Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

work page doi:10.1049/htl2.12103 2024
[4]

arXiv (2025)

Zhang, H., et al.: Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors. arXiv (2025). https: //doi.org/10.48550/arXiv.2510.04802

work page doi:10.48550/arxiv.2510.04802 2025
[5]

Journal of Industrial Information Integration, 100943 (2025)

Liu, C., et al.: Vision language model-enhanced embodied intelligence for dig- ital twin-assisted human-robot collaborative assembly. Journal of Industrial Information Integration, 100943 (2025)

work page 2025
[6]

In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp

Kumar, S.N.,et al.: Health Care Industry Use Cases of Embodied AI. In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp. 223–239. Springer, Cham (2024). https://doi.org/10. 1007/978-3-031-68256-8 10

work page 2024
[7]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020)

Tagliabue, E.,et al.: Soft tissue simulation environment to learn manipula- tion tasks in autonomous robotic surgery. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020). IEEE

work page 2020
[8]

arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

Richter, F.,et al.: Open-sourced reinforcement learning environments for surgi- cal robotics. arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

work page arXiv 1903
[9]

Artificial Intelligence Surgery 4(3) (2024)

Ding, H., et al.: Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding. Artificial Intelligence Surgery 4(3) (2024)

work page 2024
[10]

Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

Oo, K.H.,et al.: Digital twin-enabled multi-robot system for collaborative assem- bly of unorganized parts. Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

work page doi:10.1016/j.jii.2024.100764 2025
[11]

arXiv (2025)

Liu, Y., et al.: dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale. arXiv (2025). https://doi.org/10.48550/arXiv.2503.05646 . http://arxiv.org/abs/2503.05646

work page doi:10.48550/arxiv.2503.05646 2025
[12]

Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

Killeen, B.D.,et al.: In silico simulation: a key enabling technology for next- generation intelligent surgical systems. Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

work page doi:10.1088/2516-1091/acd28b 2023
[13]

Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163

Munawar, A.,et al.: Virtual reality for synergistic surgical training and data gen- eration. Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163. 2021.1999331

work page doi:10.1080/21681163 2022
[14]

International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

Killeen, B.D.,et al.: Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

work page 2024
[15]

Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Ding, H., et al.: Towards robust automation of surgical systems via digi- tal twin-based scene representations from foundation models. arXiv preprint arXiv:2409.13107 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

In: International Workshop on Digital Twin for Healthcare, pp

Ding, H.,et al.: Towards robust algorithms for surgical phase recognition via digital twin representation. In: International Workshop on Digital Twin for Healthcare, pp. 119–129 (2025). Springer

work page 2025
[17]

arXiv (2025)

Perez, A., et al.: Privacy-Preserving Operating Room Workflow Analysis using Digital Twins. arXiv (2025). https://doi.org/10.48550/arXiv.2504.12552

work page doi:10.48550/arxiv.2504.12552 2025
[18]

Shen, Y., et al.: Online reasoning video segmentation with just-in-time digital twins (2025) https://doi.org/10.48550/arXiv.2503.21056

work page doi:10.48550/arxiv.2503.21056 2025
[19]

2355–2364 (2024)

Hein, J.,et al.: Creating a Digital Twin of Spinal Surgery: A Proof of Concept, pp. 2355–2364 (2024). https://doi.org/10.48550/arXiv.2403.16736

work page doi:10.48550/arxiv.2403.16736 2024
[20]

Kleinbeck, C., et al.: Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality. Int. J. CARS19(7) (2024)

work page 2024
[21]

Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

Li, Z., et al.: Neuralangelo: High-Fidelity Neural Surface Reconstruction. arXiv 11 (2023). https://doi.org/10.48550/arXiv.2306.03092

work page doi:10.48550/arxiv.2306.03092 2023
[22]

Asso- ciation for Computing Machinery, New York, NY, USA (2023)

Loper, M.,et al.: SMPL: A Skinned Multi-Person Linear Model, 1st edn. Asso- ciation for Computing Machinery, New York, NY, USA (2023). https://doi.org/ 10.1145/3596711.3596800

work page doi:10.1145/3596711.3596800 2023
[23]

Dickerson

Xu, Y.,et al.: Vitpose++: Vision transformer for generic body pose estima- tion. arXiv preprint arXiv:2212.04246 (2022) https://doi.org/10.48550/arXiv. 2212.04246

work page internal anchor Pith review doi:10.48550/arxiv 2022
[24]

Github (2021)

EasyMoCap - Make human motion capture easier. Github (2021). https://github. com/zju3dv/EasyMocap

work page 2021
[25]

SAM 2: Segment Anything in Images and Videos

Ravi, N.,et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) https://doi.org/10.48550/arXiv.2408.00714

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00714 2024
[26]

In: 2009 IEEE International Conference on Robotics and Automation (2009)

Rusu, R.B.,et al.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation (2009). https: //doi.org/10.1109/ROBOT.2009.5152473

work page doi:10.1109/robot.2009.5152473 2009
[27]

wb ≡1 recovers the uniform variant

Park, J., et al.: Colored point cloud registration revisited. In: 2017 IEEE Interna- tional Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV. 2017.25

work page doi:10.1109/iccv 2017
[28]

In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

Sch¨ onberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

work page 2016
[29]

In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

Sch¨ ops, T., Sch¨ onberger,et al.: A multi-view stereo benchmark with high- resolution images and multi-camera videos. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

work page 2017
[30]

ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

Knapitsch, A.,et al.: Tanks and temples: Benchmarking large-scale scene recon- struction. ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

work page doi:10.1145/3072959.3073599 2017
[31]

Foundationstereo: Zero-shot stereo matching

Wen, B., et al.: FoundationStereo: Zero-Shot Stereo Matching (2025). https:// doi.org/10.48550/arXiv.2501.09898

work page doi:10.48550/arxiv.2501.09898 2025
[32]

IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

Campos, C.,et al.: ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

work page doi:10.48550/arxiv.2007.11898 2021
[33]

A New H I Survey of Active Galaxies

Vedadi, A.,et al.: Comparative evaluation of rgb-d slam methods for humanoid robot localization and mapping, pp. 807–812 (2023). https://doi.org/10.1109/ ICRoM60803.2023.10412425 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

¨Ozsoy, E.,et al.: Oracle: Large vision-language models for knowledge-guided holis- tic or domain modeling. arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

work page doi:10.48550/arxiv.2404.07031 2024

[2] [2]

In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

Killeen, B.D., et al.: FluoroSAM: A Language-Promptable Foundation Model for Flexible X-Ray Image Segmentation. In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

work page 2025

[3] [3]

Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

Zhang, H.,et al.: StraightTrack: Towards mixed reality navigation system for percutaneous K-wire insertion. Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

work page doi:10.1049/htl2.12103 2024

[4] [4]

arXiv (2025)

Zhang, H., et al.: Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors. arXiv (2025). https: //doi.org/10.48550/arXiv.2510.04802

work page doi:10.48550/arxiv.2510.04802 2025

[5] [5]

Journal of Industrial Information Integration, 100943 (2025)

Liu, C., et al.: Vision language model-enhanced embodied intelligence for dig- ital twin-assisted human-robot collaborative assembly. Journal of Industrial Information Integration, 100943 (2025)

work page 2025

[6] [6]

In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp

Kumar, S.N.,et al.: Health Care Industry Use Cases of Embodied AI. In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp. 223–239. Springer, Cham (2024). https://doi.org/10. 1007/978-3-031-68256-8 10

work page 2024

[7] [7]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020)

Tagliabue, E.,et al.: Soft tissue simulation environment to learn manipula- tion tasks in autonomous robotic surgery. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020). IEEE

work page 2020

[8] [8]

arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

Richter, F.,et al.: Open-sourced reinforcement learning environments for surgi- cal robotics. arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

work page arXiv 1903

[9] [9]

Artificial Intelligence Surgery 4(3) (2024)

Ding, H., et al.: Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding. Artificial Intelligence Surgery 4(3) (2024)

work page 2024

[10] [10]

Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

Oo, K.H.,et al.: Digital twin-enabled multi-robot system for collaborative assem- bly of unorganized parts. Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

work page doi:10.1016/j.jii.2024.100764 2025

[11] [11]

arXiv (2025)

Liu, Y., et al.: dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale. arXiv (2025). https://doi.org/10.48550/arXiv.2503.05646 . http://arxiv.org/abs/2503.05646

work page doi:10.48550/arxiv.2503.05646 2025

[12] [12]

Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

Killeen, B.D.,et al.: In silico simulation: a key enabling technology for next- generation intelligent surgical systems. Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

work page doi:10.1088/2516-1091/acd28b 2023

[13] [13]

Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163

Munawar, A.,et al.: Virtual reality for synergistic surgical training and data gen- eration. Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163. 2021.1999331

work page doi:10.1080/21681163 2022

[14] [14]

International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

Killeen, B.D.,et al.: Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

work page 2024

[15] [15]

Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Ding, H., et al.: Towards robust automation of surgical systems via digi- tal twin-based scene representations from foundation models. arXiv preprint arXiv:2409.13107 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

In: International Workshop on Digital Twin for Healthcare, pp

Ding, H.,et al.: Towards robust algorithms for surgical phase recognition via digital twin representation. In: International Workshop on Digital Twin for Healthcare, pp. 119–129 (2025). Springer

work page 2025

[17] [17]

arXiv (2025)

Perez, A., et al.: Privacy-Preserving Operating Room Workflow Analysis using Digital Twins. arXiv (2025). https://doi.org/10.48550/arXiv.2504.12552

work page doi:10.48550/arxiv.2504.12552 2025

[18] [18]

Shen, Y., et al.: Online reasoning video segmentation with just-in-time digital twins (2025) https://doi.org/10.48550/arXiv.2503.21056

work page doi:10.48550/arxiv.2503.21056 2025

[19] [19]

2355–2364 (2024)

Hein, J.,et al.: Creating a Digital Twin of Spinal Surgery: A Proof of Concept, pp. 2355–2364 (2024). https://doi.org/10.48550/arXiv.2403.16736

work page doi:10.48550/arxiv.2403.16736 2024

[20] [20]

Kleinbeck, C., et al.: Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality. Int. J. CARS19(7) (2024)

work page 2024

[21] [21]

Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

Li, Z., et al.: Neuralangelo: High-Fidelity Neural Surface Reconstruction. arXiv 11 (2023). https://doi.org/10.48550/arXiv.2306.03092

work page doi:10.48550/arxiv.2306.03092 2023

[22] [22]

Asso- ciation for Computing Machinery, New York, NY, USA (2023)

Loper, M.,et al.: SMPL: A Skinned Multi-Person Linear Model, 1st edn. Asso- ciation for Computing Machinery, New York, NY, USA (2023). https://doi.org/ 10.1145/3596711.3596800

work page doi:10.1145/3596711.3596800 2023

[23] [23]

Dickerson

Xu, Y.,et al.: Vitpose++: Vision transformer for generic body pose estima- tion. arXiv preprint arXiv:2212.04246 (2022) https://doi.org/10.48550/arXiv. 2212.04246

work page internal anchor Pith review doi:10.48550/arxiv 2022

[24] [24]

Github (2021)

EasyMoCap - Make human motion capture easier. Github (2021). https://github. com/zju3dv/EasyMocap

work page 2021

[25] [25]

SAM 2: Segment Anything in Images and Videos

Ravi, N.,et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) https://doi.org/10.48550/arXiv.2408.00714

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00714 2024

[26] [26]

In: 2009 IEEE International Conference on Robotics and Automation (2009)

Rusu, R.B.,et al.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation (2009). https: //doi.org/10.1109/ROBOT.2009.5152473

work page doi:10.1109/robot.2009.5152473 2009

[27] [27]

wb ≡1 recovers the uniform variant

Park, J., et al.: Colored point cloud registration revisited. In: 2017 IEEE Interna- tional Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV. 2017.25

work page doi:10.1109/iccv 2017

[28] [28]

In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

Sch¨ onberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

work page 2016

[29] [29]

In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

Sch¨ ops, T., Sch¨ onberger,et al.: A multi-view stereo benchmark with high- resolution images and multi-camera videos. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

work page 2017

[30] [30]

ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

Knapitsch, A.,et al.: Tanks and temples: Benchmarking large-scale scene recon- struction. ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

work page doi:10.1145/3072959.3073599 2017

[31] [31]

Foundationstereo: Zero-shot stereo matching

Wen, B., et al.: FoundationStereo: Zero-Shot Stereo Matching (2025). https:// doi.org/10.48550/arXiv.2501.09898

work page doi:10.48550/arxiv.2501.09898 2025

[32] [32]

IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

Campos, C.,et al.: ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

work page doi:10.48550/arxiv.2007.11898 2021

[33] [33]

A New H I Survey of Active Galaxies

Vedadi, A.,et al.: Comparative evaluation of rgb-d slam methods for humanoid robot localization and mapping, pp. 807–812 (2023). https://doi.org/10.1109/ ICRoM60803.2023.10412425 12

work page internal anchor Pith review Pith/arXiv arXiv 2023