SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

Akwele Johnson; Anirban Mukhopadhyay; Anirudh Dhingra; Ghazal Ghazaei; Ssharvien Kumar Sivakumar; Yannik Frisch

arxiv: 2605.16530 · v2 · pith:HAFK2V36new · submitted 2026-05-15 · 💻 cs.CV

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

Ssharvien Kumar Sivakumar , Akwele Johnson , Anirudh Dhingra , Yannik Frisch , Ghazal Ghazaei , Anirban Mukhopadhyay This is my paper

Pith reviewed 2026-05-21 07:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords neuro-symbolic world modelcataract surgery simulationdiffusion modelsim-to-real translationscene graphsurgical phase detectionvideo style transferrule-based simulator

0 comments

The pith

SWoMo separates rule-based motion from diffusion visuals to build better cataract surgery simulators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a world model called SWoMo that splits the task of simulating cataract surgery into two parts. A symbolic rule-based simulator with scene graphs generates the physical motions and tool-tissue interactions, while a diffusion model handles realistic textures, deformations, and appearances. To train the visual model they reconstruct real videos inside the simulator to create paired data for sim-to-real translation. This design is shown to generalize to new geometries, boost downstream phase detection, and support style transfer. Readers would care because accurate simulations help train surgeons and develop autonomous surgical systems.

Core claim

We introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation.

What carries the argument

Neuro-symbolic decoupling: a rule-based simulator with scene graphs for motion and interactions paired with a video diffusion model for appearance, trained via inverse pairing of real videos to simulated scenes.

If this is right

The simulator produces valid motions and interactions for geometries never seen in training.
Features extracted from the simulator improve accuracy on downstream surgical phase detection tasks.
The trained diffusion model enables unsupervised transfer of visual style between different surgical video domains.
Both visual quality and interaction fidelity exceed those of prior non-neuro-symbolic simulators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split between symbolic dynamics and learned appearance could shorten development time for simulators of other endoscopic procedures.
Because motion rules stay explicit, the system might need far less paired video data than end-to-end learned world models.
Real-time surgical planning agents could query the symbolic component directly for physically safe action proposals.

Load-bearing premise

The inverse pairing strategy can accurately reconstruct real surgical videos inside the rule-based simulator to produce high-quality paired simulated and real data suitable for training the diffusion model without introducing reconstruction errors or biases.

What would settle it

Run the simulator on interaction geometries absent from training videos and check whether tool-tissue contact points and deformation patterns match real surgery footage within measurement error.

Figures

Figures reproduced from arXiv: 2605.16530 by Akwele Johnson, Anirban Mukhopadhyay, Anirudh Dhingra, Ghazal Ghazaei, Ssharvien Kumar Sivakumar, Yannik Frisch.

**Figure 1.** Figure 1: Neuro-symbolic World Model for interactive cataract surgery simulation that decouples surgical interaction dynamics from visual appearance [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of SWoMo’s two-stage video diffusion training. In the second stage, we enable conditioning on the simulated sequence x¯1:n. For that, we freeze the parameters θ of the pre-trained diffusion backbone ǫθ from the previous stage and create a separate, trainable copy of its encoder with parameters θc. The frozen backbone ǫθ and the trainable encoder are connected through zero-initialised convolutional… view at source ↗

**Figure 3.** Figure 3: Qualitative Results of Sim-to-Real Video Transfer. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Unsupervised Video Style Transfer and Generalisation to Novel Tool Motion Downstream Evaluation on Phase Recognition: We use synthesised videos to augment the training data of a downstream model for phase recognition during cataract surgery. The phase recognition is performed using MS-TCN++ [22] trained on DINO features [5]. To generate a synthesised video, two real videos are randomly selected from the tr… view at source ↗

read the original abstract

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWoMo splits rule-based scene-graph motion from diffusion visuals for cataract surgery and uses inverse pairing on real videos, but reconstruction accuracy remains the untested hinge.

read the letter

The main point is that this paper builds a world model for cataract surgery simulation by keeping motion and interactions in a symbolic rule-based simulator with scene graphs while handing visual realism to a diffusion model. They train the diffusion part with an inverse pairing trick that reconstructs real surgical videos inside the simulator to create matched sim-real pairs for sim-to-real translation. Code, data, and weights are released, which is straightforward and helpful for checking the work.

Referee Report

2 major / 2 minor

Summary. The paper introduces SWoMo, a neuro-symbolic world model for cataract surgery simulation. It decouples motion generation and tool-tissue interactions (via a rule-based simulator and scene-graph representations) from visual realism (via a video diffusion model). An inverse pairing strategy reconstructs real surgical videos inside the symbolic simulator to produce paired sim-real data for training the diffusion model on sim-to-real translation. Experiments claim qualitative and quantitative improvements over prior work, plus generalization to unseen interaction geometries, gains in downstream phase detection, and unsupervised video style transfer. Code, data, and weights are released.

Significance. If the central claims hold, the neuro-symbolic decoupling offers a principled route to physically grounded yet visually realistic surgical simulation, addressing limitations of pure neural world models. The open release of code and data strengthens reproducibility. The approach could support surgeon training and autonomous agent development, provided the inverse reconstruction step yields sufficiently accurate paired data.

major comments (2)

[§3.2] §3.2 (Inverse Pairing Strategy): The reconstruction procedure is presented as yielding high-fidelity paired training data, yet no quantitative metrics (e.g., tool-position error, tissue-deformation L2, occlusion IoU, or temporal alignment scores) are reported for how closely the symbolic scene graphs recover real video geometry and dynamics. This is load-bearing for the sim-to-real diffusion training and the claimed generalization to unseen geometries.
[§4] §4 (Experiments): The abstract and results claim both qualitative and quantitative improvements plus downstream benefits, but the provided text supplies no concrete numbers, baselines, or ablation tables (e.g., FID, PSNR, phase-detection F1, or cross-geometry success rates). Without these, the strength of the empirical support cannot be evaluated.

minor comments (2)

Notation for scene-graph nodes and diffusion conditioning variables is introduced without a consolidated table; a single reference table would improve readability.
The claim of 'parameter-free' motion dynamics should be cross-checked against any learned components inside the rule-based simulator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each major comment below and describe the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Inverse Pairing Strategy): The reconstruction procedure is presented as yielding high-fidelity paired training data, yet no quantitative metrics (e.g., tool-position error, tissue-deformation L2, occlusion IoU, or temporal alignment scores) are reported for how closely the symbolic scene graphs recover real video geometry and dynamics. This is load-bearing for the sim-to-real diffusion training and the claimed generalization to unseen geometries.

Authors: We agree that direct quantitative metrics on reconstruction fidelity would strengthen the justification for the paired data and the generalization results. The current manuscript supports the inverse pairing through qualitative inspection, downstream task performance, and cross-geometry experiments. In the revised version we will add a dedicated evaluation subsection reporting tool-position error, tissue-deformation L2, occlusion IoU, and temporal alignment scores on a held-out set of reconstructed videos. revision: yes
Referee: [§4] §4 (Experiments): The abstract and results claim both qualitative and quantitative improvements plus downstream benefits, but the provided text supplies no concrete numbers, baselines, or ablation tables (e.g., FID, PSNR, phase-detection F1, or cross-geometry success rates). Without these, the strength of the empirical support cannot be evaluated.

Authors: We acknowledge that the numerical results and comparison tables should be presented more explicitly. Although the experiments section describes the improvements and includes supporting figures, we will revise the manuscript to include clear tables with all concrete metric values (FID, PSNR, phase-detection F1, cross-geometry success rates), baselines, and ablations so that the quantitative claims can be directly evaluated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; components and training remain independent of self-referential inputs

full rationale

The paper's derivation separates the symbolic rule-based simulator and scene-graph modeling of motion dynamics and tool-tissue interactions from the diffusion model for visual appearance. The inverse pairing strategy reconstructs real surgical videos inside the simulator to generate paired data drawn from external real videos for training the diffusion model. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains that bear the load of the results. The architecture and evaluation against generalization, phase detection, and style transfer are self-contained with independent external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that the rule-based simulator faithfully captures tool-tissue dynamics; no explicit free parameters or new invented entities are described in the abstract. The approach assembles standard components (scene graphs, diffusion models) from prior work.

axioms (1)

domain assumption The rule-based simulator and scene graph representations accurately model motion dynamics and tool-tissue interactions for cataract surgery procedures.
This premise underpins the symbolic component's ability to provide physically grounded interactions and generalization, as stated in the abstract.

pith-pipeline@v0.9.0 · 5791 in / 1461 out tokens · 63878 ms · 2026-05-21T07:33:52.659942+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

MedIA52, 24–41 (2019)

Al Hajj, H., Lamard, M., Conze, P.H., Roychowdhury, S., Hu, X., Maršalkait˙ e, G., Zisimopoulos, O., Dedmari, M.A., Zhao, F., Prellberg, J., et al.: Cataracts: Challenge on automatic tool annotation for cataract surgery. MedIA52, 24–41 (2019)

work page 2019
[2]

In: MICCAI

Biagini, D., Navab, N., Farshad, A.: Hierasurg: Hierarchy-aware diffusion model for surgical video generation. In: MICCAI. pp. 310–319. Springer (2025)

work page 2025
[3]

Authorea Preprints (2025)

Boels, M., Robertshaw, H., Booth, T.C., Granados, A., Dasgupta, P., Ourselin, S.: Surgical robot learning: From demonstration and simulation to world models-a review. Authorea Preprints (2025)

work page 2025
[4]

In: ICML (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: ICML (2024)

work page 2024
[5]

In: ICCV

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021)

work page 2021
[6]

arXiv:2511.01775 (2025)

Chen, Z., Xu, Q., Wu, J., Yang, B., Zhai, Y., Guo, G., Zhang, J., Ding, Y., Navab, N., Luo, J.: How far are surgeons from surgical world models? a pilot study on zero- shot surgical video generation with expert assessment. arXiv:2511.01775 (2025)

work page arXiv 2025
[7]

In: CVPR

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR. pp. 1280–1289 (2021)

work page 2021
[8]

Int J CARS20(7), 1421–1429 (2025)

Frisch, Y., Sivakumar, S.K., Köksal, Ç., Böhm, E., Wagner, F., Gericke, A., Ghaz- aei, G., Mukhopadhyay, A.: Surgrid: controllable surgical simulation via scene graph to image diffusion. Int J CARS20(7), 1421–1429 (2025)

work page 2025
[9]

arXiv:2312.06295 (2023)

Ghamsarian, N., El-Shabrawi, Y., Nasirihaghighi, S., Putzgruber-Adamitsch, D., Zinkernagel, M., Wolf, S., Schoeffmann, K., Sznitman, R.: Cataract-1k: cataract surgery dataset for scene segmentation, phase recognition, and irregularity detec- tion. arXiv:2312.06295 (2023)

work page arXiv 2023
[10]

Godot Engine Contributors: Godot engine (2024),https://godotengine.org, free and open-source 2D and 3D game engine

work page 2024
[11]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv:1803.101222(3) (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv:2211.13221 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

arXiv:2512.23162 (2025) 10 Authors Suppressed Due to Excessive Length

He, Y., Guo, P., Xu, M., Li, Z., Myronenko, A., Imans, D., Liu, B., Yang, D., Gu, M., Ji, Y., et al.: Surgworld: Learning surgical robot policies from videos via world modeling. arXiv:2512.23162 (2025) 10 Authors Suppressed Due to Excessive Length

work page arXiv 2025
[14]

NeurIPS33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS33, 6840–6851 (2020)

work page 2020
[15]

NeurIPS35, 8633–8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. NeurIPS35, 8633–8646 (2022)

work page 2022
[16]

Nature methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

work page 2021
[17]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Jang,J.,Ye,S.,Lin,Z.,Xiang,J.,Bjorck,J.,Fang,Y.,Hu,F.,Huang,S.,Kundalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv:2505.12705 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Let.5(4), 6670–6677 (2020)

Kadian, A., Truong, J., Gokaslan, A., Clegg, A., Wijmans, E., Lee, S., Savva, M., Chernova, S., Batra, D.: Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robot Autom. Let.5(4), 6670–6677 (2020)

work page 2020
[19]

Co- tracker: It is better to track together

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is 472 better to track together. arXiv:2307.07635473(2023)

work page arXiv 2023
[20]

In: MICCAI Workshop on Data Engineering in Medical Imaging

Koju, S., Bastola, S., Shrestha, P., Amgain, S., Shrestha, Y.R., Poudel, R.P., Bhat- tarai, B.: Surgical vision world model. In: MICCAI Workshop on Data Engineering in Medical Imaging. pp. 1–10. Springer (2025)

work page 2025
[21]

In: MICCAI

Li, C., Liu, H., Liu, Y., Feng, B.Y., Li, W., Liu, X., Chen, Z., Shao, J., Yuan, Y.: Endora: Video generation models as endoscopy simulators. In: MICCAI. pp. 230–240. Springer (2024)

work page 2024
[22]

IEEE TPAMI (2020)

Li, S., Farha, Y.A., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi-stage tem- poral convolutional network for action segmentation. IEEE TPAMI (2020)

work page 2020
[23]

arXiv:2508.11200 (2025)

Lin, H., Li, B., Au, K.W.S.: Visuomotor grasping with world models for surgical robots. arXiv:2508.11200 (2025)

work page arXiv 2025
[24]

In: WACV

Martyniak, S., Kaleta, J., Dall’Alba, D., Naskręt, M., Płotka, S., Korzeniowski, P.: Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models. In: WACV. pp. 4268–4278. IEEE (2025)

work page 2025
[25]

Scientific reports11(1), 10945 (2021)

Nair, A.G., Ahiwalay, C., Bacchav, A.E., Sheth, T., Lansingh, V.C., Vedula, S.S., Bhatt, V., Reddy, J.C., Vadavalli, P.K., Praveen, S., et al.: Effectiveness of simulation-based training for manual small incision cataract surgery among novice surgeons: a randomized controlled trial. Scientific reports11(1), 10945 (2021)

work page 2021
[26]

In: ECCV

Niu, M., Cun, X., Wang, X., Zhang, Y., Shan, Y., Zheng, Y.: Mofa-video: Con- trollable image animation via generative motion field adaptions in frozen image- to-video diffusion model. In: ECCV. pp. 111–128. Springer (2024)

work page 2024
[27]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: MICCAI

Sivakumar, S.K., Frisch, Y., Ghazaei, G., Mukhopadhyay, A.: Sg2vid: Scene graphs enable fine-grained control for video synthesis. In: MICCAI. pp. 511–521. Springer (2025)

work page 2025
[29]

Int J CARS20(7), 1409–1419 (2025)

Sivakumar, S.K., Frisch, Y., Ranem, A., Mukhopadhyay, A.: Sasvi: segment any surgical video. Int J CARS20(7), 1409–1419 (2025)

work page 2025
[30]

In: CVPR

Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: Stylegan-v: A continuous video gener- ator with the price, image quality and perks of stylegan2. In: CVPR. pp. 3626–3636 (2022)

work page 2022
[31]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717 (2018) SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation 11

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

In: European Conference on Computer Vision

Venkatesh, D.K., Rivoir, D., Pfeiffer, M., Speidel, S.: Surgical-cd: Generating surgi- cal images via unpaired image translation with latent consistency diffusion models. In: European Conference on Computer Vision. pp. 218–235. Springer (2024)

work page 2024
[33]

arXiv:2411.01647 (2024)

Wang, Z., Zhang, L., Wang, L., Zhu, M., Zhang, Z.: Optical flow representation alignment mamba diffusion model for medical video generation. arXiv:2411.01647 (2024)

work page arXiv 2024
[34]

arXiv:2510.21447 (2025)

Yang, Y., Zhang, Z., Zhang, X., Zeng, Y., Li, H., Zuo, W.: Physworld: From real videos to world models of deformable objects via physics-aware demonstration synthesis. arXiv:2510.21447 (2025)

work page arXiv 2025
[35]

In: ICCV

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023)

work page 2023
[36]

In: CVPR

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)

work page 2018

[1] [1]

MedIA52, 24–41 (2019)

Al Hajj, H., Lamard, M., Conze, P.H., Roychowdhury, S., Hu, X., Maršalkait˙ e, G., Zisimopoulos, O., Dedmari, M.A., Zhao, F., Prellberg, J., et al.: Cataracts: Challenge on automatic tool annotation for cataract surgery. MedIA52, 24–41 (2019)

work page 2019

[2] [2]

In: MICCAI

Biagini, D., Navab, N., Farshad, A.: Hierasurg: Hierarchy-aware diffusion model for surgical video generation. In: MICCAI. pp. 310–319. Springer (2025)

work page 2025

[3] [3]

Authorea Preprints (2025)

Boels, M., Robertshaw, H., Booth, T.C., Granados, A., Dasgupta, P., Ourselin, S.: Surgical robot learning: From demonstration and simulation to world models-a review. Authorea Preprints (2025)

work page 2025

[4] [4]

In: ICML (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: ICML (2024)

work page 2024

[5] [5]

In: ICCV

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021)

work page 2021

[6] [6]

arXiv:2511.01775 (2025)

Chen, Z., Xu, Q., Wu, J., Yang, B., Zhai, Y., Guo, G., Zhang, J., Ding, Y., Navab, N., Luo, J.: How far are surgeons from surgical world models? a pilot study on zero- shot surgical video generation with expert assessment. arXiv:2511.01775 (2025)

work page arXiv 2025

[7] [7]

In: CVPR

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR. pp. 1280–1289 (2021)

work page 2021

[8] [8]

Int J CARS20(7), 1421–1429 (2025)

Frisch, Y., Sivakumar, S.K., Köksal, Ç., Böhm, E., Wagner, F., Gericke, A., Ghaz- aei, G., Mukhopadhyay, A.: Surgrid: controllable surgical simulation via scene graph to image diffusion. Int J CARS20(7), 1421–1429 (2025)

work page 2025

[9] [9]

arXiv:2312.06295 (2023)

Ghamsarian, N., El-Shabrawi, Y., Nasirihaghighi, S., Putzgruber-Adamitsch, D., Zinkernagel, M., Wolf, S., Schoeffmann, K., Sznitman, R.: Cataract-1k: cataract surgery dataset for scene segmentation, phase recognition, and irregularity detec- tion. arXiv:2312.06295 (2023)

work page arXiv 2023

[10] [10]

Godot Engine Contributors: Godot engine (2024),https://godotengine.org, free and open-source 2D and 3D game engine

work page 2024

[11] [11]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv:1803.101222(3) (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv:2211.13221 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

arXiv:2512.23162 (2025) 10 Authors Suppressed Due to Excessive Length

He, Y., Guo, P., Xu, M., Li, Z., Myronenko, A., Imans, D., Liu, B., Yang, D., Gu, M., Ji, Y., et al.: Surgworld: Learning surgical robot policies from videos via world modeling. arXiv:2512.23162 (2025) 10 Authors Suppressed Due to Excessive Length

work page arXiv 2025

[14] [14]

NeurIPS33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS33, 6840–6851 (2020)

work page 2020

[15] [15]

NeurIPS35, 8633–8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. NeurIPS35, 8633–8646 (2022)

work page 2022

[16] [16]

Nature methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

work page 2021

[17] [17]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Jang,J.,Ye,S.,Lin,Z.,Xiang,J.,Bjorck,J.,Fang,Y.,Hu,F.,Huang,S.,Kundalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv:2505.12705 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Let.5(4), 6670–6677 (2020)

Kadian, A., Truong, J., Gokaslan, A., Clegg, A., Wijmans, E., Lee, S., Savva, M., Chernova, S., Batra, D.: Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robot Autom. Let.5(4), 6670–6677 (2020)

work page 2020

[19] [19]

Co- tracker: It is better to track together

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is 472 better to track together. arXiv:2307.07635473(2023)

work page arXiv 2023

[20] [20]

In: MICCAI Workshop on Data Engineering in Medical Imaging

Koju, S., Bastola, S., Shrestha, P., Amgain, S., Shrestha, Y.R., Poudel, R.P., Bhat- tarai, B.: Surgical vision world model. In: MICCAI Workshop on Data Engineering in Medical Imaging. pp. 1–10. Springer (2025)

work page 2025

[21] [21]

In: MICCAI

Li, C., Liu, H., Liu, Y., Feng, B.Y., Li, W., Liu, X., Chen, Z., Shao, J., Yuan, Y.: Endora: Video generation models as endoscopy simulators. In: MICCAI. pp. 230–240. Springer (2024)

work page 2024

[22] [22]

IEEE TPAMI (2020)

Li, S., Farha, Y.A., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi-stage tem- poral convolutional network for action segmentation. IEEE TPAMI (2020)

work page 2020

[23] [23]

arXiv:2508.11200 (2025)

Lin, H., Li, B., Au, K.W.S.: Visuomotor grasping with world models for surgical robots. arXiv:2508.11200 (2025)

work page arXiv 2025

[24] [24]

In: WACV

Martyniak, S., Kaleta, J., Dall’Alba, D., Naskręt, M., Płotka, S., Korzeniowski, P.: Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models. In: WACV. pp. 4268–4278. IEEE (2025)

work page 2025

[25] [25]

Scientific reports11(1), 10945 (2021)

Nair, A.G., Ahiwalay, C., Bacchav, A.E., Sheth, T., Lansingh, V.C., Vedula, S.S., Bhatt, V., Reddy, J.C., Vadavalli, P.K., Praveen, S., et al.: Effectiveness of simulation-based training for manual small incision cataract surgery among novice surgeons: a randomized controlled trial. Scientific reports11(1), 10945 (2021)

work page 2021

[26] [26]

In: ECCV

Niu, M., Cun, X., Wang, X., Zhang, Y., Shan, Y., Zheng, Y.: Mofa-video: Con- trollable image animation via generative motion field adaptions in frozen image- to-video diffusion model. In: ECCV. pp. 111–128. Springer (2024)

work page 2024

[27] [27]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

In: MICCAI

Sivakumar, S.K., Frisch, Y., Ghazaei, G., Mukhopadhyay, A.: Sg2vid: Scene graphs enable fine-grained control for video synthesis. In: MICCAI. pp. 511–521. Springer (2025)

work page 2025

[29] [29]

Int J CARS20(7), 1409–1419 (2025)

Sivakumar, S.K., Frisch, Y., Ranem, A., Mukhopadhyay, A.: Sasvi: segment any surgical video. Int J CARS20(7), 1409–1419 (2025)

work page 2025

[30] [30]

In: CVPR

Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: Stylegan-v: A continuous video gener- ator with the price, image quality and perks of stylegan2. In: CVPR. pp. 3626–3636 (2022)

work page 2022

[31] [31]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717 (2018) SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation 11

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

In: European Conference on Computer Vision

Venkatesh, D.K., Rivoir, D., Pfeiffer, M., Speidel, S.: Surgical-cd: Generating surgi- cal images via unpaired image translation with latent consistency diffusion models. In: European Conference on Computer Vision. pp. 218–235. Springer (2024)

work page 2024

[33] [33]

arXiv:2411.01647 (2024)

Wang, Z., Zhang, L., Wang, L., Zhu, M., Zhang, Z.: Optical flow representation alignment mamba diffusion model for medical video generation. arXiv:2411.01647 (2024)

work page arXiv 2024

[34] [34]

arXiv:2510.21447 (2025)

Yang, Y., Zhang, Z., Zhang, X., Zeng, Y., Li, H., Zuo, W.: Physworld: From real videos to world models of deformable objects via physics-aware demonstration synthesis. arXiv:2510.21447 (2025)

work page arXiv 2025

[35] [35]

In: ICCV

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023)

work page 2023

[36] [36]

In: CVPR

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)

work page 2018