pith. sign in

arxiv: 2511.07412 · v2 · submitted 2025-11-10 · 💻 cs.CV · cs.RO

TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research

Pith reviewed 2026-05-17 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords digital twinsoperating roomsphotorealistic simulationembodied AIvisual localizationstereo reconstructionsurgical roboticsreal-to-sim
0
0 comments X

The pith

TwinOR reconstructs operating rooms into dynamic 3D digital twins with centimeter accuracy for embodied AI training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TwinOR as a real-to-sim system that builds photorealistic digital twins of operating rooms by first capturing static geometry and then modeling ongoing human and equipment movements. These elements combine into a controllable 3D environment that produces synthetic stereo images, monocular video, and depth data. Experiments test standard models for stereo depth estimation and visual SLAM on the generated streams, finding that their accuracy stays within the ranges those models report on actual indoor datasets. This setup matters for embodied AI because it supplies a safe, repeatable space to develop and evaluate surgical agents without the regulatory and practical barriers of real operating rooms.

Core claim

TwinOR is a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows.

What carries the argument

The real-to-sim pipeline that reconstructs static OR geometry at centimeter accuracy and models continuous human and equipment motion, then fuses both to generate photorealistic sensor streams for perception and localization tasks.

If this is right

  • Perception models such as FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data that falls inside their published ranges on real indoor scenes.
  • Embodied AI systems for surgery can be trained and tested in a risk-free, fully controllable digital environment.
  • The pipeline supports automatic creation of multiple dynamic OR twins from real-world captures.
  • Synthetic sensor streams from TwinOR can substitute for real data in visual localization and geometry-understanding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sensor realism extends to additional modalities such as force or audio, TwinOR could support full closed-loop robotic surgery simulation.
  • Large collections of TwinOR instances might supply training data for medical foundation models that generalize across different hospitals and procedures.
  • The same reconstruction-plus-motion approach could be applied to other constrained environments such as ICU bays or interventional suites.

Load-bearing premise

Reconstructed geometry and modeled motions, when fused, produce sensor observations whose statistical properties are close enough to real OR data that downstream AI models exhibit comparable behavior.

What would settle it

Deploy a model trained exclusively on TwinOR-generated data into a real operating room and measure whether its perception or localization error exceeds the error range observed when the same model is trained and tested on real indoor datasets.

read the original abstract

Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains an open challenge. We introduce TwinOR, a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs. The system reconstructs static geometry and continuously models human and equipment motion. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows. In our experiments, TwinOR synthesizes stereo and monocular RGB streams as well as depth observations for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 evaluated on TwinOR-synthesized data achieve performance within their reported accuracy ranges on real-world indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for emulating real-world perception and localization challenge. By establishing a perception-grounded real-to-sim pipeline, TwinOR enables the automatic construction of dynamic, photorealistic digital twins of ORs. As a safe and scalable environment for experimentation, TwinOR opens new opportunities for translating embodied intelligence from simulation to real-world clinical environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TwinOR, a real-to-sim infrastructure for photorealistic dynamic digital twins of operating rooms. It reconstructs static OR geometry at centimeter-level accuracy, continuously models human and equipment motions, and fuses these into an immersive 3D environment supporting controllable RGB, stereo, and depth simulation. Experiments evaluate perception models (FoundationStereo, ORB-SLAM3) on the synthesized streams and report that their accuracy falls within previously published ranges for real indoor datasets such as TUM and EuRoC, from which the authors conclude that TwinOR achieves sensor-level realism sufficient to emulate real-world perception and localization challenges.

Significance. If the realism claim holds, TwinOR would provide a valuable, safe, and scalable testbed for embodied AI in surgical settings, addressing regulatory barriers to real OR experimentation. The fusion of static reconstruction with dynamic motion modeling and the use of downstream task performance as a proxy for fidelity are practical strengths that could accelerate translation from simulation to clinical environments.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.
  2. [Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.
minor comments (2)
  1. [Abstract] Specify the exact performance numbers and real-world dataset references (TUM, EuRoC, etc.) used for the 'within published ranges' comparison so readers can assess the tightness of the match.
  2. Clarify the measurement protocol and error statistics that support the 'centimeter-level accuracy' claim for static geometry reconstruction.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our evaluation methodology and commit to revisions that strengthen the presentation of results while acknowledging practical constraints in OR data collection.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments: The central claim that TwinOR supplies 'sensor-level realism sufficient for emulating real-world perception and localization challenge' rests on indirect evidence that FoundationStereo and ORB-SLAM3 achieve accuracy on TwinOR data within ranges reported for general indoor datasets. This does not directly verify that the fused geometry and motion models reproduce OR-specific sensor statistics (specular surfaces, surgical-lamp lighting changes, equipment occlusions, or staff motion artifacts). Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures is required to substantiate the claim.

    Authors: We agree that direct sensor-level comparisons would offer stronger substantiation. Our evaluation follows standard practice in sim-to-real research by using downstream task performance (stereo matching and visual SLAM) as a proxy for overall fidelity, which captures the combined effects of geometry, lighting, and dynamics on perception. In the revision we will add a dedicated limitations paragraph discussing this indirect approach, expand error distribution analysis from the reported experiments, and provide more detail on how the reconstruction pipeline models OR-specific elements such as specular surfaces and dynamic lighting changes. revision: partial

  2. Referee: [Abstract] Abstract: The statement that 'synthesized data yields model performance within published real-world ranges' is presented without accompanying quantitative error metrics, ablation studies on dynamic components, or explicit validation of motion models against ground-truth trajectories. This omission leaves the evidence for photorealism and dynamic fidelity incomplete.

    Authors: We will revise the abstract and experiments section to include explicit quantitative metrics (e.g., absolute trajectory error for ORB-SLAM3 and disparity errors for FoundationStereo) with direct numerical comparisons to the cited real-world ranges. We will also add ablation studies that isolate the contribution of the dynamic motion models and clarify the validation of those models against the ground-truth trajectories obtained during the real-to-sim reconstruction process. revision: yes

standing simulated objections not resolved
  • Direct quantitative comparison of noise distributions, lighting, or motion artifacts against real OR captures

Circularity Check

0 steps flagged

No significant circularity; validation uses external benchmarks

full rationale

The paper describes a reconstruction pipeline that builds static OR geometry to centimeter accuracy, models continuous human/equipment motion, and fuses both into a controllable 3D simulation environment. The central claim of sensor-level realism is supported by running FoundationStereo and ORB-SLAM3 on the synthesized RGB/depth streams and noting that their accuracy falls inside previously published ranges for real indoor datasets (TUM, EuRoC). This comparison draws on independent external results rather than any internal fit, self-definition, or self-citation that would make the outcome equivalent to the construction inputs by construction. No load-bearing step in the abstract or described workflow reduces to a renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the system appears to rely on standard computer-vision reconstruction and tracking techniques without introducing new physical entities.

pith-pipeline@v0.9.0 · 5633 in / 1093 out tokens · 54289 ms · 2026-05-17T23:10:29.632770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 7.0

    GEAR is an EM-style alternating optimization framework that jointly models geometry and motion in Gaussian Splatting to improve reconstruction of complex articulated objects.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

    ¨Ozsoy, E.,et al.: Oracle: Large vision-language models for knowledge-guided holis- tic or domain modeling. arXiv preprint arXiv:2404.07031 (2024) https://doi.org/ 10.48550/arXiv.2404.07031

  2. [2]

    In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

    Killeen, B.D., et al.: FluoroSAM: A Language-Promptable Foundation Model for Flexible X-Ray Image Segmentation. In: Medical Image Computing and Computer Assisted Intervention MICCAI 2025

  3. [3]

    Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

    Zhang, H.,et al.: StraightTrack: Towards mixed reality navigation system for percutaneous K-wire insertion. Healthcare Technology Letters11, 355–364 (2024) https://doi.org/10.1049/htl2.12103

  4. [4]

    arXiv (2025)

    Zhang, H., et al.: Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors. arXiv (2025). https: //doi.org/10.48550/arXiv.2510.04802

  5. [5]

    Journal of Industrial Information Integration, 100943 (2025)

    Liu, C., et al.: Vision language model-enhanced embodied intelligence for dig- ital twin-assisted human-robot collaborative assembly. Journal of Industrial Information Integration, 100943 (2025)

  6. [6]

    In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp

    Kumar, S.N.,et al.: Health Care Industry Use Cases of Embodied AI. In: Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, pp. 223–239. Springer, Cham (2024). https://doi.org/10. 1007/978-3-031-68256-8 10

  7. [7]

    In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020)

    Tagliabue, E.,et al.: Soft tissue simulation environment to learn manipula- tion tasks in autonomous robotic surgery. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2020). IEEE

  8. [8]

    arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

    Richter, F.,et al.: Open-sourced reinforcement learning environments for surgi- cal robotics. arXiv preprint arXiv:1903.02090 (2019) https://doi.org/10.48550/ arXiv.1903.02090 10

  9. [9]

    Artificial Intelligence Surgery 4(3) (2024)

    Ding, H., et al.: Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding. Artificial Intelligence Surgery 4(3) (2024)

  10. [10]

    Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

    Oo, K.H.,et al.: Digital twin-enabled multi-robot system for collaborative assem- bly of unorganized parts. Journal of Industrial Information Integration44, 100764 (2025) https://doi.org/10.1016/j.jii.2024.100764

  11. [11]

    arXiv (2025)

    Liu, Y., et al.: dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale. arXiv (2025). https://doi.org/10.48550/arXiv.2503.05646 . http://arxiv.org/abs/2503.05646

  12. [12]

    Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

    Killeen, B.D.,et al.: In silico simulation: a key enabling technology for next- generation intelligent surgical systems. Progress in Biomedical Engineering5(3), 032001 (2023) https://doi.org/10.1088/2516-1091/acd28b

  13. [13]

    Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163

    Munawar, A.,et al.: Virtual reality for synergistic surgical training and data gen- eration. Computer Methods in Biomechanics and Biomedical Engineering: Imag- ing & Visualization10(4), 366–374 (2022) https://doi.org/10.1080/21681163. 2021.1999331

  14. [14]

    International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

    Killeen, B.D.,et al.: Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1213–1222 (2024) https://doi.org/10.1007/ s11548-024-03138-7

  15. [15]

    Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

    Ding, H., et al.: Towards robust automation of surgical systems via digi- tal twin-based scene representations from foundation models. arXiv preprint arXiv:2409.13107 (2024)

  16. [16]

    In: International Workshop on Digital Twin for Healthcare, pp

    Ding, H.,et al.: Towards robust algorithms for surgical phase recognition via digital twin representation. In: International Workshop on Digital Twin for Healthcare, pp. 119–129 (2025). Springer

  17. [17]

    arXiv (2025)

    Perez, A., et al.: Privacy-Preserving Operating Room Workflow Analysis using Digital Twins. arXiv (2025). https://doi.org/10.48550/arXiv.2504.12552

  18. [18]

    Shen, Y., et al.: Online reasoning video segmentation with just-in-time digital twins (2025) https://doi.org/10.48550/arXiv.2503.21056

  19. [19]

    2355–2364 (2024)

    Hein, J.,et al.: Creating a Digital Twin of Spinal Surgery: A Proof of Concept, pp. 2355–2364 (2024). https://doi.org/10.48550/arXiv.2403.16736

  20. [20]

    Kleinbeck, C., et al.: Neural digital twins: reconstructing complex medical environments for spatial planning in virtual reality. Int. J. CARS19(7) (2024)

  21. [21]

    Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin

    Li, Z., et al.: Neuralangelo: High-Fidelity Neural Surface Reconstruction. arXiv 11 (2023). https://doi.org/10.48550/arXiv.2306.03092

  22. [22]

    Asso- ciation for Computing Machinery, New York, NY, USA (2023)

    Loper, M.,et al.: SMPL: A Skinned Multi-Person Linear Model, 1st edn. Asso- ciation for Computing Machinery, New York, NY, USA (2023). https://doi.org/ 10.1145/3596711.3596800

  23. [23]

    Dickerson

    Xu, Y.,et al.: Vitpose++: Vision transformer for generic body pose estima- tion. arXiv preprint arXiv:2212.04246 (2022) https://doi.org/10.48550/arXiv. 2212.04246

  24. [24]

    Github (2021)

    EasyMoCap - Make human motion capture easier. Github (2021). https://github. com/zju3dv/EasyMocap

  25. [25]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N.,et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) https://doi.org/10.48550/arXiv.2408.00714

  26. [26]

    In: 2009 IEEE International Conference on Robotics and Automation (2009)

    Rusu, R.B.,et al.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation (2009). https: //doi.org/10.1109/ROBOT.2009.5152473

  27. [27]

    wb ≡1 recovers the uniform variant

    Park, J., et al.: Colored point cloud registration revisited. In: 2017 IEEE Interna- tional Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV. 2017.25

  28. [28]

    In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Sch¨ onberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  29. [29]

    In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

    Sch¨ ops, T., Sch¨ onberger,et al.: A multi-view stereo benchmark with high- resolution images and multi-camera videos. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017)

  30. [30]

    ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

    Knapitsch, A.,et al.: Tanks and temples: Benchmarking large-scale scene recon- struction. ACM Transactions on Graphics (SIGGRAPH) (2017) https://doi.org/ 10.1145/3072959.3073599

  31. [31]

    Foundationstereo: Zero-shot stereo matching

    Wen, B., et al.: FoundationStereo: Zero-Shot Stereo Matching (2025). https:// doi.org/10.48550/arXiv.2501.09898

  32. [32]

    IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

    Campos, C.,et al.: ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021) https://doi.org/10.48550/arXiv.2007.11898

  33. [33]

    A New H I Survey of Active Galaxies

    Vedadi, A.,et al.: Comparative evaluation of rgb-d slam methods for humanoid robot localization and mapping, pp. 807–812 (2023). https://doi.org/10.1109/ ICRoM60803.2023.10412425 12