TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research
Pith reviewed 2026-05-17 23:10 UTC · model grok-4.3
The pith
TwinOR reconstructs operating rooms into dynamic 3D digital twins with centimeter accuracy for embodied AI training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TwinOR is a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows.
What carries the argument
The real-to-sim pipeline that reconstructs static OR geometry at centimeter accuracy and models continuous human and equipment motion, then fuses both to generate photorealistic sensor streams for perception and localization tasks.
If this is right
- Perception models such as FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data that falls inside their published ranges on real indoor scenes.
- Embodied AI systems for surgery can be trained and tested in a risk-free, fully controllable digital environment.
- The pipeline supports automatic creation of multiple dynamic OR twins from real-world captures.
- Synthetic sensor streams from TwinOR can substitute for real data in visual localization and geometry-understanding benchmarks.
Where Pith is reading between the lines
- If the sensor realism extends to additional modalities such as force or audio, TwinOR could support full closed-loop robotic surgery simulation.
- Large collections of TwinOR instances might supply training data for medical foundation models that generalize across different hospitals and procedures.
- The same reconstruction-plus-motion approach could be applied to other constrained environments such as ICU bays or interventional suites.
Load-bearing premise
Reconstructed geometry and modeled motions, when fused, produce sensor observations whose statistical properties are close enough to real OR data that downstream AI models exhibit comparable behavior.
What would settle it
Deploy a model trained exclusively on TwinOR-generated data into a real operating room and measure whether its perception or localization error exceeds the error range observed when the same model is trained and tested on real indoor datasets.
read the original abstract
Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains an open challenge. We introduce TwinOR, a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows. In our experiments, TwinOR synthesizes stereo and monocular RGB streams as well as depth observations for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 evaluated on TwinOR-synthesized data achieve performance within their reported accuracy ranges on real-world indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for emulating real-world perception and localization challenge. By establishing a perception-grounded real-to-sim pipeline, TwinOR enables the automatic construction of dynamic, photorealistic digital twins of ORs. As a safe and scalable environment for experimentation, TwinOR opens new opportunities for translating embodied intelligence from simulation to real-world clinical environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TwinOR, a real-to-sim infrastructure for photorealistic dynamic digital twins of operating rooms. It reconstructs static OR geometry at centimeter-level accuracy, continuously models human and equipment motions, and fuses these into an immersive 3D environment supporting controllable RGB, stereo, and depth simulation. Experiments evaluate perception models (FoundationStereo, ORB-SLAM3) on the synthesized streams and report that their accuracy falls within previously published ranges for real indoor datasets such as TUM and EuRoC, from which the authors conclude that TwinOR achieves sensor-level realism sufficient to emulate real-world perception and localization challenges.
Significance. If the realism claim holds, TwinOR would provide a valuable, safe, and scalable testbed for embodied AI in surgical settings, addressing regulatory barriers to real OR experimentation. The fusion of static reconstruction with dynamic motion modeling and the use of downstream task performance as a proxy for fidelity are practical strengths that could accelerate translation from simulation to clinical environments.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.
- [Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.
minor comments (2)
- [Abstract] Specify the exact performance numbers and real-world dataset references (TUM, EuRoC, etc.) used for the 'within published ranges' comparison so readers can assess the tightness of the match.
- Clarify the measurement protocol and error statistics that support the 'centimeter-level accuracy' claim for static geometry reconstruction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our evaluation methodology and commit to revisions that strengthen the presentation of results while acknowledging practical constraints in OR data collection.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.
Authors: We agree that direct sensor-level comparisons would offer stronger substantiation. Our evaluation follows standard practice in sim-to-real research by using downstream task performance (stereo matching and visual SLAM) as a proxy for overall fidelity, which captures the combined effects of geometry, lighting, and dynamics on perception. In the revision we will add a dedicated limitations paragraph discussing this indirect approach, expand error distribution analysis from the reported experiments, and provide more detail on how the reconstruction pipeline models OR-specific elements such as specular surfaces and dynamic lighting changes. revision: partial
-
Referee: [Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.
Authors: We will revise the abstract and experiments section to include explicit quantitative metrics (e.g., absolute trajectory error for ORB-SLAM3 and disparity errors for FoundationStereo) with direct numerical comparisons to the cited real-world ranges. We will also add ablation studies that isolate the contribution of the dynamic motion models and clarify the validation of those models against the ground-truth trajectories obtained during the real-to-sim reconstruction process. revision: yes
- Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures
Circularity Check
No significant circularity; validation uses external benchmarks
full rationale
The paper describes a reconstruction pipeline that builds static OR geometry to centimeter accuracy, models continuous human/equipment motion, and fuses both into a controllable 3D simulation environment. The central claim of sensor-level realism is supported by running FoundationStereo and ORB-SLAM3 on the synthesized RGB/depth streams and noting that their accuracy falls inside previously published ranges for real indoor datasets (TUM, EuRoC). This comparison draws on independent external results rather than any internal fit, self-definition, or self-citation that would make the outcome equivalent to the construction inputs by construction. No load-bearing step in the abstract or described workflow reduces to a renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TwinOR reconstructs static geometry... continuously models human and equipment motion... fused into an immersive 3D environment... FoundationStereo and ORB-SLAM3... within their reported accuracy ranges on real-world indoor datasets.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
centimeter-level accuracy... SSIM 0.90/0.92... MPJPE of 3.52 cm
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting
GEAR is an EM-style alternating optimization framework that jointly models geometry and motion in Gaussian Splatting to improve reconstruction of complex articulated objects.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031
¨Ozsoy, E.,et al.: Oracle: Large vision-language models for knowledge-guided holis- tic or domain modeling. arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031
-
[2]
In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025
Killeen, B.D., et al.: FluoroSAM: A Language-Promptable Foundation Model for Flexible X-Ray Image Segmentation. In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025
work page 2025
-
[3]
Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103
Zhang, H.,et al.: StraightTrack: Towards mixed reality navigation system for percutaneous K-wire insertion. Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103
-
[4]
Zhang, H., et al.: Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors. arXiv (2025). https: //doi.org/10.48550/arXiv.2510.04802
-
[5]
Journal of Industrial Information Integration, 100943 (2025)
Liu, C., et al.: Vision language model-enhanced embodied intelligence for dig- ital twin-assisted human-robot collaborative assembly. Journal of Industrial Information Integration, 100943 (2025)
work page 2025
-
[6]
Kumar, S.N.,et al.: Health Care Industry Use Cases of Embodied AI. In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp. 223–239. Springer, Cham (2024). https://doi.org/10. 1007/978-3-031-68256-8 10
work page 2024
-
[7]
In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020)
Tagliabue, E.,et al.: Soft tissue simulation environment to learn manipula- tion tasks in autonomous robotic surgery. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020). IEEE
work page 2020
-
[8]
arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10
Richter, F.,et al.: Open-sourced reinforcement learning environments for surgi- cal robotics. arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10
-
[9]
Artificial Intelligence Surgery 4(3) (2024)
Ding, H., et al.: Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding. Artificial Intelligence Surgery 4(3) (2024)
work page 2024
-
[10]
Oo, K.H.,et al.: Digital twin-enabled multi-robot system for collaborative assem- bly of unorganized parts. Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764
-
[11]
Liu, Y., et al.: dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale. arXiv (2025). https://doi.org/10.48550/arXiv.2503.05646 . http://arxiv.org/abs/2503.05646
-
[12]
Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b
Killeen, B.D.,et al.: In silico simulation: a key enabling technology for next- generation intelligent surgical systems. Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b
-
[13]
Munawar, A.,et al.: Virtual reality for synergistic surgical training and data gen- eration. Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163. 2021.1999331
-
[14]
Killeen, B.D.,et al.: Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7
work page 2024
-
[15]
Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models
Ding, H., et al.: Towards robust automation of surgical systems via digi- tal twin-based scene representations from foundation models. arXiv preprint arXiv:2409.13107 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
In: International Workshop on Digital Twin for Healthcare, pp
Ding, H.,et al.: Towards robust algorithms for surgical phase recognition via digital twin representation. In: International Workshop on Digital Twin for Healthcare, pp. 119–129 (2025). Springer
work page 2025
-
[17]
Perez, A., et al.: Privacy-Preserving Operating Room Workflow Analysis using Digital Twins. arXiv (2025). https://doi.org/10.48550/arXiv.2504.12552
-
[18]
Shen, Y., et al.: Online reasoning video segmentation with just-in-time digital twins (2025) https://doi.org/10.48550/arXiv.2503.21056
-
[19]
Hein, J.,et al.: Creating a Digital Twin of Spinal Surgery: A Proof of Concept, pp. 2355–2364 (2024). https://doi.org/10.48550/arXiv.2403.16736
-
[20]
Kleinbeck, C., et al.: Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality. Int. J. CARS19(7) (2024)
work page 2024
-
[21]
Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin
Li, Z., et al.: Neuralangelo: High-Fidelity Neural Surface Reconstruction. arXiv 11 (2023). https://doi.org/10.48550/arXiv.2306.03092
-
[22]
Asso- ciation for Computing Machinery, New York, NY, USA (2023)
Loper, M.,et al.: SMPL: A Skinned Multi-Person Linear Model, 1st edn. Asso- ciation for Computing Machinery, New York, NY, USA (2023). https://doi.org/ 10.1145/3596711.3596800
-
[23]
Xu, Y.,et al.: Vitpose++: Vision transformer for generic body pose estima- tion. arXiv preprint arXiv:2212.04246 (2022) https://doi.org/10.48550/arXiv. 2212.04246
work page internal anchor Pith review doi:10.48550/arxiv 2022
-
[24]
EasyMoCap - Make human motion capture easier. Github (2021). https://github. com/zju3dv/EasyMocap
work page 2021
-
[25]
SAM 2: Segment Anything in Images and Videos
Ravi, N.,et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) https://doi.org/10.48550/arXiv.2408.00714
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00714 2024
-
[26]
In: 2009 IEEE International Conference on Robotics and Automation (2009)
Rusu, R.B.,et al.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation (2009). https: //doi.org/10.1109/ROBOT.2009.5152473
-
[27]
wb ≡1 recovers the uniform variant
Park, J., et al.: Colored point cloud registration revisited. In: 2017 IEEE Interna- tional Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV. 2017.25
-
[28]
In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Sch¨ onberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
work page 2016
-
[29]
In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)
Sch¨ ops, T., Sch¨ onberger,et al.: A multi-view stereo benchmark with high- resolution images and multi-camera videos. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)
work page 2017
-
[30]
ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599
Knapitsch, A.,et al.: Tanks and temples: Benchmarking large-scale scene recon- struction. ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599
-
[31]
Foundationstereo: Zero-shot stereo matching
Wen, B., et al.: FoundationStereo: Zero-Shot Stereo Matching (2025). https:// doi.org/10.48550/arXiv.2501.09898
-
[32]
IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898
Campos, C.,et al.: ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898
-
[33]
A New H I Survey of Active Galaxies
Vedadi, A.,et al.: Comparative evaluation of rgb-d slam methods for humanoid robot localization and mapping, pp. 807–812 (2023). https://doi.org/10.1109/ ICRoM60803.2023.10412425 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.