SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

Hengyu Liu; Shengyuan Liu; Wentao Pan; Wuyang Li; Xinyu Liu; Yixuan Yuan

arxiv: 2606.19889 · v1 · pith:77SVHMPRnew · submitted 2026-06-18 · 💻 cs.CV

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

Wentao Pan , Wuyang Li , Shengyuan Liu , Xinyu Liu , Hengyu Liu , Yixuan Yuan This is my paper

Pith reviewed 2026-06-26 18:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical world modellong-horizon predictioninstrument-tissue dynamicsdeformation consistencydrift adaptationvideo predictionrobot surgerySurgWorld-Bench

0 comments

The pith

SurgVista generates long-horizon surgical video predictions with consistent instrument-tissue interactions by enforcing deformation coherence and adapting to drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to build a surgical world model that predicts future frames from initial observations and actions, fixing cases where instrument contact fails to produce spatially consistent tissue deformation and where errors accumulate over many autoregressive steps. It does so with two training methods: Deformation Consistency Regularization, which pulls point trajectories from videos and applies latent contrastive learning to keep deformations coherent across frames, and Drift Adaptation Training, which perturbs input frames with prediction residuals and augmentations matched to long-horizon statistics. A new benchmark, SurgWorld-Bench, supports evaluation across procedure types with separate scores for instrument motion and tissue response. If these changes hold, the model sustains visual quality and interaction fidelity as rollout length increases, offering a route to train robot policies on simulated futures rather than risky real procedures.

Core claim

SurgVista mitigates spatial interaction incoherence and temporal fidelity collapse in surgical world models through Deformation Consistency Regularization, which extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning to strengthen physically consistent instrument-tissue dynamics, and Drift Adaptation Training, which mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts, as shown by consistent outperformance on SurgWorld-Bench with gains that widen at longer

What carries the argument

Deformation Consistency Regularization extracts scene-point trajectories and enforces cross-frame coherence via latent contrastive learning; Drift Adaptation Training perturbs conditioning frames with residuals and augmentations to counter drift accumulation.

If this is right

Prediction quality and interaction fidelity hold up better than prior methods as the number of future frames increases.
Instrument contact produces spatially consistent tissue deformation instead of incoherent motion.
The introduced benchmark separates evaluation of motion accuracy from tissue-response fidelity across diverse procedures.
World-model training can proceed without direct in vivo exploration by generating extended action-conditioned sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The trajectory-based contrastive approach could transfer to video prediction tasks involving other deformable surfaces outside surgery.
If the methods scale, they would lower the volume of expert demonstrations needed for policy learning in contact-rich robotics.
A direct test would compare the learned dynamics against measured force or deformation data from instrumented phantoms.

Load-bearing premise

Enforcing cross-frame coherence on extracted scene-point trajectories through latent contrastive learning produces physically consistent instrument-tissue dynamics rather than merely visually plausible motion.

What would settle it

Long-horizon rollouts that produce tissue deformations violating basic physical constraints such as volume conservation or elasticity rules, while still scoring high on visual and contrastive metrics.

Figures

Figures reproduced from arXiv: 2606.19889 by Hengyu Liu, Shengyuan Liu, Wentao Pan, Wuyang Li, Xinyu Liu, Yixuan Yuan.

**Figure 1.** Figure 1: Surgical world modeling. (a) Simulating scene evolution for safe policy learning. (b) Predicting future frames from an initial frame and instrument actions. Two key challenges: (c) Tissue fails to deform under instrument contact. (d) Visual quality degrades over long-horizon rollouts. control signal has progressed from coarse textual descriptions [19, 32] to fine-grained spatial specifications [3, 9, 27],… view at source ↗

**Figure 2.** Figure 2: Overview of SurgVista. (a) Training paradigm, consisting of two training recipes: Deformation Consistency Regularization (DCR) and Drift Adaptation Training (DAT). (b) DAT simulates inference-time conditioning drift via two complementary proxies. (c) DCR enforces crossframe motion coherence via contrastive learning on tracked scene-point trajectories. 3.1 Latent Action Encoding Prior surgical world models… view at source ↗

**Figure 3.** Figure 3: Main photometric statistics over longhorizon predictions. Model-agnostic proxy: photometric augmentations. As training progresses, the probe residual δ shrinks with improving model accuracy. Yet repeated decode-encode cycles during autoregressive rollout still introduce persistent photometric degradations, including contrast drift, color skew, texture blur, and specular blowout. To quantify this effe… view at source ↗

**Figure 4.** Figure 4: Overview of SurgWorld-Bench. (a) Task setup with three procedure types and two evaluation horizons. (b) Evaluation suite with decoupled instrument and tissue trajectory metrics. (c) Data curation pipeline for extracting region-specific motion trajectories. Task and evaluation horizons. Given an initial frame s0 and an instrument trajectory a1:T , the model predicts the future state sequence ˆs1:T . The ben… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with state-of-the-art methods. Left: short-horizon predictions at frames 1, 5, 10, and 15. Right: long-horizon predictions at frames 1, 100, 500, and 800. 5.3 Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurgVista adds two targeted training recipes and a benchmark for long-horizon surgical video prediction, but the physical consistency of the resulting dynamics rests on an untested assumption about latent contrastive learning.

read the letter

The paper's core contribution is a pair of training adjustments meant to fix two specific problems in surgical world models: instruments that don't deform tissue in a spatially coherent way, and predictions that degrade over long autoregressive rollouts. It also releases SurgWorld-Bench with decoupled metrics for instrument accuracy and tissue response.

The Deformation Consistency Regularization extracts scene-point trajectories and applies latent contrastive learning to enforce cross-frame coherence. Drift Adaptation Training perturbs conditioning frames with residuals and photometric changes drawn from observed drift statistics. These are presented as new recipes rather than direct copies of prior work, and the benchmark setup looks practical for the domain.

The work does a reasonable job of identifying the failure modes and designing fixes that match them. The benchmark's separation of metrics is a clear step forward for evaluation in this area.

The soft spot is the leap from latent trajectory coherence to "physically consistent instrument-tissue dynamics." The method has no explicit physics, constitutive model, or force supervision, and the abstract gives no sign that the outputs were checked against biomechanical ground truth rather than visual and interaction-fidelity scores. Visual plausibility can be achieved without mechanical validity, and nothing in the description shows that distinction was tested. The claim of consistent outperformance is stated but the abstract supplies no numbers, so the size and reliability of the gains remain unclear from the summary alone.

This is for people working on simulation for autonomous surgery or similar constrained video prediction tasks. A reader who needs concrete recipes and a new testbed for long-horizon surgical rollouts would find usable material here.

It deserves peer review because the ideas are specific, the benchmark is new, and the problem is practically relevant, even if the physical-consistency argument needs more evidence.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SurgVista, a surgical world model for generating action-conditioned future frames in robotic surgery. It proposes two training recipes—Deformation Consistency Regularization, which extracts scene-point trajectories from videos and applies latent contrastive learning to enforce cross-frame coherence, and Drift Adaptation Training, which perturbs conditioning frames with prediction residuals and photometric augmentations—to address spatial interaction incoherence and temporal fidelity collapse. The work also introduces SurgWorld-Bench for evaluating long-range rollouts with decoupled metrics on instrument motion and tissue response. Experiments claim that SurgVista outperforms state-of-the-art methods on visual quality, temporal consistency, and interaction fidelity, with gains increasing over longer prediction horizons.

Significance. If validated, the contributions could meaningfully advance world-model-based policy learning for autonomous surgery by mitigating key failure modes in long-horizon prediction. The introduction of SurgWorld-Bench provides a useful standardized evaluation resource for the community. The trajectory-based contrastive regularization is a reasonable attempt to inject consistency priors without explicit physics simulation.

major comments (3)

[Section 3.1] Section 3.1 (Deformation Consistency Regularization): the claim that latent contrastive learning on extracted scene-point trajectories yields 'physically consistent instrument-tissue dynamics' lacks supporting evidence; the method enforces cross-frame coherence in latent space but provides no ablation, comparison to physics-based simulators, or biomechanical measurements to distinguish physical validity from temporally smooth but non-physical motion.
[SurgWorld-Bench and evaluation] SurgWorld-Bench description and evaluation metrics: the decoupled 'tissue-response fidelity' metric is presented as capturing interaction quality, yet the paper does not demonstrate that it measures adherence to real biomechanical constraints rather than visual or kinematic plausibility; this distinction is load-bearing for the central claim that the regularization produces physically consistent dynamics.
[Experiments] Experiments section: while outperformance is asserted across visual quality, temporal consistency, and interaction fidelity with widening gains at longer horizons, the manuscript supplies no quantitative tables, specific metric values, baseline implementations, or statistical tests in the provided description, preventing assessment of whether the data actually support the stated claims.

minor comments (1)

[Abstract] Abstract: including at least one key quantitative result (e.g., a primary metric improvement at a given horizon) would strengthen the summary of contributions.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Section 3.1] Section 3.1 (Deformation Consistency Regularization): the claim that latent contrastive learning on extracted scene-point trajectories yields 'physically consistent instrument-tissue dynamics' lacks supporting evidence; the method enforces cross-frame coherence in latent space but provides no ablation, comparison to physics-based simulators, or biomechanical measurements to distinguish physical validity from temporally smooth but non-physical motion.

Authors: We agree that the regularization enforces cross-frame coherence in latent space via contrastive learning on trajectories and does not explicitly model or validate physical laws. The phrasing 'physically consistent' in the manuscript is interpretive rather than directly evidenced by biomechanical data. We will revise Section 3.1 and related claims to describe the outcome as 'plausible instrument-tissue dynamics' consistent with observed trajectories. We will add an ablation isolating the regularization's effect on coherence metrics. Direct comparisons to physics-based simulators or biomechanical measurements are not feasible within this data-driven framework without new resources. revision: partial
Referee: [SurgWorld-Bench and evaluation] SurgWorld-Bench description and evaluation metrics: the decoupled 'tissue-response fidelity' metric is presented as capturing interaction quality, yet the paper does not demonstrate that it measures adherence to real biomechanical constraints rather than visual or kinematic plausibility; this distinction is load-bearing for the central claim that the regularization produces physically consistent dynamics.

Authors: We concur that the tissue-response fidelity metric evaluates visual and kinematic alignment with observed deformations rather than verifying biomechanical constraints. This is an inherent limitation of video-based evaluation. We will revise the SurgWorld-Bench description and central claims to specify that the metric assesses visual plausibility and consistency with real data trajectories, removing implications of biomechanical validation. revision: yes
Referee: [Experiments] Experiments section: while outperformance is asserted across visual quality, temporal consistency, and interaction fidelity with widening gains at longer horizons, the manuscript supplies no quantitative tables, specific metric values, baseline implementations, or statistical tests in the provided description, preventing assessment of whether the data actually support the stated claims.

Authors: The complete manuscript contains quantitative tables with specific metric values, baseline implementation details, and statistical test results in the Experiments section. We will ensure these elements are clearly highlighted and cross-referenced in the revised submission to facilitate assessment. revision: no

standing simulated objections not resolved

Direct biomechanical measurements or comparisons against physics-based simulators to substantiate physical validity claims, as these require additional data collection and expertise outside the current video-based study.

Circularity Check

0 steps flagged

No significant circularity; training recipes and metrics are independently specified.

full rationale

The paper defines two explicit training procedures—Deformation Consistency Regularization (extracting trajectories then applying latent contrastive learning) and Drift Adaptation Training (perturbing frames with residuals and augmentations)—as new recipes that address stated failure modes. These are not shown to reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. Evaluation relies on a new benchmark with decoupled metrics for instrument motion and tissue response, presented as empirical comparisons rather than derivations that collapse to inputs by construction. No uniqueness theorems, ansatzes via prior self-work, or renaming of known results appear in the provided description. The central claims remain falsifiable via the stated visual, temporal, and interaction metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; the central claim rests on the effectiveness of the two described training recipes whose internal details are not provided.

pith-pipeline@v0.9.1-grok · 5764 in / 1045 out tokens · 42870 ms · 2026-06-26T18:07:20.455139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 18 canonical work pages · 13 internal anchors

[1]

Pixel-wise recognition for holistic surgical scene understanding // Medical Image Analysis

Ayobi Nicolás, Rodríguez Santiago, Pérez Alejandra, Hernández Isabela, Aparicio Nicolás, Dessevres Eugénie, Peña Sebastián, Santander Jessica, Caicedo Juan Ignacio, Fernández Nicolás, others. Pixel-wise recognition for holistic surgical scene understanding // Medical Image Analysis. 2025. 103726

2025
[2]

Hierasurg: Hierarchy-aware diffusion model for surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention

Biagini Diego, Navab Nassir, Farshad Azade. Hierasurg: Hierarchy-aware diffusion model for surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention. 2025. 310–319

2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann Andreas, Dockhorn Tim, Kulal Sumith, Mendelevitch Daniel, Kilian Maciej, Lorenz Dominik, Levi Yam, English Zion, Voleti Vikram, Letts Adam, others. Stable video diffusion: Scaling latent video diffusion models to large datasets // arXiv preprint arXiv:2311.15127. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Genie: generative interactive environments // Proceedings of the 41st International Conference on Machine Learning

Bruce Jake, Dennis Michael, Edwards Ashley, Parker-Holder Jack, Shi Yuge, Hughes Edward, Lai Matthew, Mavalankar Aditi, Steigerwald Richie, Apps Chris, others. Genie: generative interactive environments // Proceedings of the 41st International Conference on Machine Learning. 2024. 4603–4623

2024
[5]

MONAI: An open-source framework for deep learning in healthcare

Cardoso M Jorge, Li Wenqi, Brown Richard, Ma Nic, Kerfoot Eric, Wang Yiheng, Murrey Benjamin, Myronenko Andriy, Zhao Can, Yang Dong, others. Monai: An open-source framework for deep learning in healthcare // arXiv preprint arXiv:2211.02701. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion // Advances in Neural Information Processing Systems

Chen Boyuan, Martí Monsó Diego, Du Yilun, Simchowitz Max, Tedrake Russ, Sitzmann Vincent. Diffusion forcing: Next-token prediction meets full-sequence diffusion // Advances in Neural Information Processing Systems. 2024. 37. 24081–24125

2024
[7]

A simple framework for contrastive learning of visual representations // International conference on machine learning

Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning. 2020. 1597–1607

2020
[8]

Surgsora: Object-aware diffusion model for controllable surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention

Chen Tong, Yang Shuya, Wang Junyi, Bai Long, Ren Hongliang, Zhou Luping. Surgsora: Object-aware diffusion model for controllable surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention. 2025. 521–531

2025
[9]

Wan-move: Motion-controllable video generation via latent trajectory guidance // arXiv preprint arXiv:2512.08765

Chu Ruihang, He Yefei, Chen Zhekai, Zhang Shiwei, Xu Xiaogang, Xia Bin, Wang Dingdong, Yi Hongwei, Liu Xihui, Zhao Hengshuang, others. Wan-move: Motion-controllable video generation via latent trajectory guidance // arXiv preprint arXiv:2512.08765. 2025

work page arXiv 2025
[10]

Robotic surgery // Nature Reviews Bioengineering

Ciuti Gastone, Webster III Robert J, Kwok Ka-Wai, Menciassi Arianna. Robotic surgery // Nature Reviews Bioengineering. 2025. 3, 7. 565–578

2025
[11]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui Justin, Wu Jie, Li Ming, Yang Tao, Li Xiaojie, Wang Rui, Bai Andrew, Ban Yuanhao, Hsieh Cho-Jui. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation // arXiv preprint arXiv:2510.02283. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

World Models

Doersch Carl, Gupta Ankush, Markeeva Larisa, Recasens Adria, Smaira Lucas, Aytar Yusuf, Carreira Joao, Zisserman Andrew, Yang Yi. Tap-vid: A benchmark for tracking any point in a video // Advances in Neural Information Processing Systems. 2022. 35. 13610–13626. [14]Ha David, Schmidhuber Jürgen. World models // arXiv preprint arXiv:1803.10122. 2018. 2, 3. 440

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Dream to Control: Learning Behaviors by Latent Imagination

Hafner Danijar, Lillicrap Timothy, Ba Jimmy, Norouzi Mohammad. Dream to control: Learning behaviors by latent imagination // arXiv preprint arXiv:1912.01603. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[14]

Mastering Diverse Domains through World Models

Hafner Danijar, Pasukonis Jurgis, Ba Jimmy, Lillicrap Timothy. Mastering diverse domains through world models // arXiv preprint arXiv:2301.04104. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738

2020
[16]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

He Xianglong, Peng Chunli, Liu Zexiang, Wang Boyang, Zhang Yifan, Cui Qi, Kang Fei, Jiang Biao, An Mengyin, Ren Yangyang, others. Matrix-game 2.0: An open-source real-time and streaming interactive world model // arXiv preprint arXiv:2508.13009. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling // arXiv preprint arXiv:2512.23162

He Yufan, Guo Pengfei, Xu Mengya, Li Zhaoshuo, Myronenko Andriy, Imans Dillan, Liu Bingjie, Yang Dongren, Gu Mingxue, Ji Yongnan, others. SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling // arXiv preprint arXiv:2512.23162. 2025. 10

work page arXiv 2025
[18]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text // Proceedings of the Computer Vision and Pattern Recognition Conference

Henschel Roberto, Khachatryan Levon, Poghosyan Hayk, Hayrapetyan Daniil, Tadevosyan Vahram, Wang Zhangyang, Navasardyan Shant, Shi Humphrey. Streamingt2v: Consistent, dynamic, and extendable long video generation from text // Proceedings of the Computer Vision and Pattern Recognition Conference
[19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang Xun, Li Zhengqi, He Guande, Zhou Mingyuan, Shechtman Eli. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion // arXiv preprint arXiv:2506.08009. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Vbench: Comprehensive benchmark suite for video generative models // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang Ziqi, He Yinan, Yu Jiashuo, Zhang Fan, Si Chenyang, Jiang Yuming, Zhang Yuanhan, Wu Tianxing, Jin Qingyang, Chanpaisit Nattapol, others. Vbench: Comprehensive benchmark suite for video generative models // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 21807–21818

2024
[21]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos // Proceedings of the IEEE/CVF International Conference on Computer Vision

Karaev Nikita, Makarov Yuri, Wang Jianyuan, Neverova Natalia, Vedaldi Andrea, Rupprecht Christian. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. 6013–6022

2025
[22]

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling // International Conference on Learning Representations 2025 (ICLR 2025)

Li Wuyang, Pan Wentao, Luan Po-Chien, Gao Yang, Alahi Alexandre. Stable Video Infinity: Infinite-Length Video Generation with Error Recycling // International Conference on Learning Representations 2025 (ICLR 2025). 2026

2025
[23]

Elucidating the Exposure Bias in Diffusion Models // 12th International Conference on Learning Representations, ICLR 2024

Ning Mang, Li Mingxiao, Su Jianlin, Salah Albert Ali, Ertugrul Itir Onal. Elucidating the Exposure Bias in Diffusion Models // 12th International Conference on Learning Representations, ICLR 2024. 2024

2024
[24]

Cholectrack20: A multi-perspective tracking dataset for surgical tools // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Nwoye Chinedu Innocent, Elgohary Kareem, Srinivas Anvita, Zaid Fauzan, Lavanchy Joël L, Padoy Nicolas. Cholectrack20: A multi-perspective tracking dataset for surgical tools // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025. 8942–8952

2025
[25]

SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation // arXiv preprint arXiv:2603.13024

Rapuri Sampath, Seenivasan Lalithkumar, Schneider Dominik, Soberanis-Mukul Roger, He Yufan, Ding Hao, Xu Jiru, Yu Chenhao, Jing Chenyan, Guo Pengfei, others. SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation // arXiv preprint arXiv:2603.13024. 2026

work page arXiv 2026
[26]

General-purpose foundation models for increased autonomy in robot-assisted surgery // Nature Machine Intelligence

Schmidgall Samuel, Kim Ji Woong, Kuntz Alan, Ghazi Ahmed Ezzat, Krieger Axel. General-purpose foundation models for increased autonomy in robot-assisted surgery // Nature Machine Intelligence. 2024. 6, 11. 1275–1283

2024
[27]

Generalization in generation: A closer look at exposure bias // Proceedings of the 3rd Workshop on Neural Generation and Translation

Schmidt Florian. Generalization in generation: A closer look at exposure bias // Proceedings of the 3rd Workshop on Neural Generation and Translation. 2019. 157–167

2019
[28]

History-Guided Video Diffusion // International Conference on Machine Learning

Song Kiwhan, Chen Boyuan, Simchowitz Max, Du Yilun, Tedrake Russ, Sitzmann Vincent. History-Guided Video Diffusion // International Conference on Machine Learning. 2025. 56242–56280

2025
[29]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Sun Wenqiang, Zhang Haiyu, Wang Haoyuan, Wu Junta, Wang Zehan, Wang Zhenwei, Wang Yunhong, Zhang Jun, Wang Tengfei, Guo Chunchao. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling // arXiv preprint arXiv:2512.14614. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks // arXiv preprint arXiv:2503.12531

Turkcan Mehmet Kerem, Ballo Mattia, Filicori Filippo, Kostic Zoran. Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks // arXiv preprint arXiv:2503.12531. 2025

work page arXiv 2025
[31]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team, Wang Ang, Ai Baole, Wen Bin, Mao Chaojie, Xie Chen-Wei, Chen Di, Yu Feiwu, Zhao Haiming, Yang Jianxiao, others. Wan: Open and advanced large-scale video generative models // arXiv preprint arXiv:2503.20314. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Wang Zile, Liu Zexiang, Li Jaixing, Huang Kaichen, Xu Baixin, Kang Fei, An Mengyin, Wang Peiyu, Jiang Biao, Wei Yichen, others. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory // arXiv preprint arXiv:2604.08995. 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

HunyuanVideo 1.5 Technical Report

Wu Bing, Zou Chang, Li Changlin, Huang Duojun, Yang Fang, Tan Hao, Peng Jack, Wu Jianbing, Xiong Jiangfeng, Jiang Jie, others. Hunyuanvideo 1.5 technical report // arXiv preprint arXiv:2511.18870. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Medical World Model // Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang Yijun, Wang Zhao-Yang, Liu Qiuping, Sun Shuwen, Wang Kang, Chellappa Rama, Zhou Zongwei, Yuille Alan, Zhu Lei, Zhang Yu-Dong, others. Medical World Model // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. 8319–8329

2025
[35]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang Zhuoyi, Teng Jiayan, Zheng Wendi, Ding Ming, Huang Shiyu, Xu Jiazheng, Yang Yuanming, Hong Wenyi, Zhang Xiaohan, Feng Guanyu, others. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer // arXiv preprint arXiv:2408.06072. 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

arXiv preprint arXiv:2504.12626 (2025)

Zhang Lvmin, Cai Shengqu, Li Muyang, Wetzstein Gordon, Agrawala Maneesh. Frame context packing and drift prevention in next-frame-prediction video diffusion models // arXiv preprint arXiv:2504.12626. 2025

work page arXiv 2025
[37]

The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang Richard, Isola Phillip, Efros Alexei A, Shechtman Eli, Wang Oliver. The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 586–595. [40]Zia Aneeq, Berniker Max, Garcia Nespolo Rogerio, Perreault Conor, Bhattacharyya Kiran, Liu Xi, Wang Ziheng, Kon...

2018

[1] [1]

Pixel-wise recognition for holistic surgical scene understanding // Medical Image Analysis

Ayobi Nicolás, Rodríguez Santiago, Pérez Alejandra, Hernández Isabela, Aparicio Nicolás, Dessevres Eugénie, Peña Sebastián, Santander Jessica, Caicedo Juan Ignacio, Fernández Nicolás, others. Pixel-wise recognition for holistic surgical scene understanding // Medical Image Analysis. 2025. 103726

2025

[2] [2]

Hierasurg: Hierarchy-aware diffusion model for surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention

Biagini Diego, Navab Nassir, Farshad Azade. Hierasurg: Hierarchy-aware diffusion model for surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention. 2025. 310–319

2025

[3] [3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann Andreas, Dockhorn Tim, Kulal Sumith, Mendelevitch Daniel, Kilian Maciej, Lorenz Dominik, Levi Yam, English Zion, Voleti Vikram, Letts Adam, others. Stable video diffusion: Scaling latent video diffusion models to large datasets // arXiv preprint arXiv:2311.15127. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Genie: generative interactive environments // Proceedings of the 41st International Conference on Machine Learning

Bruce Jake, Dennis Michael, Edwards Ashley, Parker-Holder Jack, Shi Yuge, Hughes Edward, Lai Matthew, Mavalankar Aditi, Steigerwald Richie, Apps Chris, others. Genie: generative interactive environments // Proceedings of the 41st International Conference on Machine Learning. 2024. 4603–4623

2024

[5] [5]

MONAI: An open-source framework for deep learning in healthcare

Cardoso M Jorge, Li Wenqi, Brown Richard, Ma Nic, Kerfoot Eric, Wang Yiheng, Murrey Benjamin, Myronenko Andriy, Zhao Can, Yang Dong, others. Monai: An open-source framework for deep learning in healthcare // arXiv preprint arXiv:2211.02701. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion // Advances in Neural Information Processing Systems

Chen Boyuan, Martí Monsó Diego, Du Yilun, Simchowitz Max, Tedrake Russ, Sitzmann Vincent. Diffusion forcing: Next-token prediction meets full-sequence diffusion // Advances in Neural Information Processing Systems. 2024. 37. 24081–24125

2024

[7] [7]

A simple framework for contrastive learning of visual representations // International conference on machine learning

Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning. 2020. 1597–1607

2020

[8] [8]

Surgsora: Object-aware diffusion model for controllable surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention

Chen Tong, Yang Shuya, Wang Junyi, Bai Long, Ren Hongliang, Zhou Luping. Surgsora: Object-aware diffusion model for controllable surgical video generation // International Conference on Medical Image Computing and Computer-Assisted Intervention. 2025. 521–531

2025

[9] [9]

Wan-move: Motion-controllable video generation via latent trajectory guidance // arXiv preprint arXiv:2512.08765

Chu Ruihang, He Yefei, Chen Zhekai, Zhang Shiwei, Xu Xiaogang, Xia Bin, Wang Dingdong, Yi Hongwei, Liu Xihui, Zhao Hengshuang, others. Wan-move: Motion-controllable video generation via latent trajectory guidance // arXiv preprint arXiv:2512.08765. 2025

work page arXiv 2025

[10] [10]

Robotic surgery // Nature Reviews Bioengineering

Ciuti Gastone, Webster III Robert J, Kwok Ka-Wai, Menciassi Arianna. Robotic surgery // Nature Reviews Bioengineering. 2025. 3, 7. 565–578

2025

[11] [11]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui Justin, Wu Jie, Li Ming, Yang Tao, Li Xiaojie, Wang Rui, Bai Andrew, Ban Yuanhao, Hsieh Cho-Jui. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation // arXiv preprint arXiv:2510.02283. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

World Models

Doersch Carl, Gupta Ankush, Markeeva Larisa, Recasens Adria, Smaira Lucas, Aytar Yusuf, Carreira Joao, Zisserman Andrew, Yang Yi. Tap-vid: A benchmark for tracking any point in a video // Advances in Neural Information Processing Systems. 2022. 35. 13610–13626. [14]Ha David, Schmidhuber Jürgen. World models // arXiv preprint arXiv:1803.10122. 2018. 2, 3. 440

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Dream to Control: Learning Behaviors by Latent Imagination

Hafner Danijar, Lillicrap Timothy, Ba Jimmy, Norouzi Mohammad. Dream to control: Learning behaviors by latent imagination // arXiv preprint arXiv:1912.01603. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[14] [14]

Mastering Diverse Domains through World Models

Hafner Danijar, Pasukonis Jurgis, Ba Jimmy, Lillicrap Timothy. Mastering diverse domains through world models // arXiv preprint arXiv:2301.04104. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738

2020

[16] [16]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

He Xianglong, Peng Chunli, Liu Zexiang, Wang Boyang, Zhang Yifan, Cui Qi, Kang Fei, Jiang Biao, An Mengyin, Ren Yangyang, others. Matrix-game 2.0: An open-source real-time and streaming interactive world model // arXiv preprint arXiv:2508.13009. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling // arXiv preprint arXiv:2512.23162

He Yufan, Guo Pengfei, Xu Mengya, Li Zhaoshuo, Myronenko Andriy, Imans Dillan, Liu Bingjie, Yang Dongren, Gu Mingxue, Ji Yongnan, others. SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling // arXiv preprint arXiv:2512.23162. 2025. 10

work page arXiv 2025

[18] [18]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text // Proceedings of the Computer Vision and Pattern Recognition Conference

Henschel Roberto, Khachatryan Levon, Poghosyan Hayk, Hayrapetyan Daniil, Tadevosyan Vahram, Wang Zhangyang, Navasardyan Shant, Shi Humphrey. Streamingt2v: Consistent, dynamic, and extendable long video generation from text // Proceedings of the Computer Vision and Pattern Recognition Conference

[19] [19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang Xun, Li Zhengqi, He Guande, Zhou Mingyuan, Shechtman Eli. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion // arXiv preprint arXiv:2506.08009. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Vbench: Comprehensive benchmark suite for video generative models // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang Ziqi, He Yinan, Yu Jiashuo, Zhang Fan, Si Chenyang, Jiang Yuming, Zhang Yuanhan, Wu Tianxing, Jin Qingyang, Chanpaisit Nattapol, others. Vbench: Comprehensive benchmark suite for video generative models // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 21807–21818

2024

[21] [21]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos // Proceedings of the IEEE/CVF International Conference on Computer Vision

Karaev Nikita, Makarov Yuri, Wang Jianyuan, Neverova Natalia, Vedaldi Andrea, Rupprecht Christian. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. 6013–6022

2025

[22] [22]

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling // International Conference on Learning Representations 2025 (ICLR 2025)

Li Wuyang, Pan Wentao, Luan Po-Chien, Gao Yang, Alahi Alexandre. Stable Video Infinity: Infinite-Length Video Generation with Error Recycling // International Conference on Learning Representations 2025 (ICLR 2025). 2026

2025

[23] [23]

Elucidating the Exposure Bias in Diffusion Models // 12th International Conference on Learning Representations, ICLR 2024

Ning Mang, Li Mingxiao, Su Jianlin, Salah Albert Ali, Ertugrul Itir Onal. Elucidating the Exposure Bias in Diffusion Models // 12th International Conference on Learning Representations, ICLR 2024. 2024

2024

[24] [24]

Cholectrack20: A multi-perspective tracking dataset for surgical tools // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Nwoye Chinedu Innocent, Elgohary Kareem, Srinivas Anvita, Zaid Fauzan, Lavanchy Joël L, Padoy Nicolas. Cholectrack20: A multi-perspective tracking dataset for surgical tools // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025. 8942–8952

2025

[25] [25]

SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation // arXiv preprint arXiv:2603.13024

Rapuri Sampath, Seenivasan Lalithkumar, Schneider Dominik, Soberanis-Mukul Roger, He Yufan, Ding Hao, Xu Jiru, Yu Chenhao, Jing Chenyan, Guo Pengfei, others. SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation // arXiv preprint arXiv:2603.13024. 2026

work page arXiv 2026

[26] [26]

General-purpose foundation models for increased autonomy in robot-assisted surgery // Nature Machine Intelligence

Schmidgall Samuel, Kim Ji Woong, Kuntz Alan, Ghazi Ahmed Ezzat, Krieger Axel. General-purpose foundation models for increased autonomy in robot-assisted surgery // Nature Machine Intelligence. 2024. 6, 11. 1275–1283

2024

[27] [27]

Generalization in generation: A closer look at exposure bias // Proceedings of the 3rd Workshop on Neural Generation and Translation

Schmidt Florian. Generalization in generation: A closer look at exposure bias // Proceedings of the 3rd Workshop on Neural Generation and Translation. 2019. 157–167

2019

[28] [28]

History-Guided Video Diffusion // International Conference on Machine Learning

Song Kiwhan, Chen Boyuan, Simchowitz Max, Du Yilun, Tedrake Russ, Sitzmann Vincent. History-Guided Video Diffusion // International Conference on Machine Learning. 2025. 56242–56280

2025

[29] [29]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Sun Wenqiang, Zhang Haiyu, Wang Haoyuan, Wu Junta, Wang Zehan, Wang Zhenwei, Wang Yunhong, Zhang Jun, Wang Tengfei, Guo Chunchao. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling // arXiv preprint arXiv:2512.14614. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks // arXiv preprint arXiv:2503.12531

Turkcan Mehmet Kerem, Ballo Mattia, Filicori Filippo, Kostic Zoran. Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks // arXiv preprint arXiv:2503.12531. 2025

work page arXiv 2025

[31] [31]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team, Wang Ang, Ai Baole, Wen Bin, Mao Chaojie, Xie Chen-Wei, Chen Di, Yu Feiwu, Zhao Haiming, Yang Jianxiao, others. Wan: Open and advanced large-scale video generative models // arXiv preprint arXiv:2503.20314. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Wang Zile, Liu Zexiang, Li Jaixing, Huang Kaichen, Xu Baixin, Kang Fei, An Mengyin, Wang Peiyu, Jiang Biao, Wei Yichen, others. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory // arXiv preprint arXiv:2604.08995. 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

HunyuanVideo 1.5 Technical Report

Wu Bing, Zou Chang, Li Changlin, Huang Duojun, Yang Fang, Tan Hao, Peng Jack, Wu Jianbing, Xiong Jiangfeng, Jiang Jie, others. Hunyuanvideo 1.5 technical report // arXiv preprint arXiv:2511.18870. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Medical World Model // Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang Yijun, Wang Zhao-Yang, Liu Qiuping, Sun Shuwen, Wang Kang, Chellappa Rama, Zhou Zongwei, Yuille Alan, Zhu Lei, Zhang Yu-Dong, others. Medical World Model // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. 8319–8329

2025

[35] [35]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang Zhuoyi, Teng Jiayan, Zheng Wendi, Ding Ming, Huang Shiyu, Xu Jiazheng, Yang Yuanming, Hong Wenyi, Zhang Xiaohan, Feng Guanyu, others. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer // arXiv preprint arXiv:2408.06072. 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

arXiv preprint arXiv:2504.12626 (2025)

Zhang Lvmin, Cai Shengqu, Li Muyang, Wetzstein Gordon, Agrawala Maneesh. Frame context packing and drift prevention in next-frame-prediction video diffusion models // arXiv preprint arXiv:2504.12626. 2025

work page arXiv 2025

[37] [37]

The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang Richard, Isola Phillip, Efros Alexei A, Shechtman Eli, Wang Oliver. The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 586–595. [40]Zia Aneeq, Berniker Max, Garcia Nespolo Rogerio, Perreault Conor, Bhattacharyya Kiran, Liu Xi, Wang Ziheng, Kon...

2018