Recognition: 2 theorem links
· Lean TheoremFrom Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
Pith reviewed 2026-05-12 03:11 UTC · model grok-4.3
The pith
Lifting articulated kinematics into five image-aligned modalities with hierarchical routing generates more accurate action-conditioned surgical videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting articulated kinematics into a unified set of five image-aligned control modalities and employing a hierarchically routed visual control framework that selectively activates relevant modalities and scales, along with kinematic-prior-guided routing losses and a budgeted sparse scheme, the method achieves improved action faithfulness, visual fidelity, and cross-domain generalization in action-conditioned surgical video generation, with an efficient variant reducing latency.
What carries the argument
The kinematic-to-visual lifting paradigm combined with the hierarchically routed visual control framework, which dynamically allocates conditioning capacity across five image-aligned modalities using routing losses and sparsity.
If this is right
- Generated videos more faithfully reproduce the input robot actions and motions.
- Visual quality of the videos improves compared to uniform conditioning methods.
- The model generalizes better to new surgical domains or tools.
- The efficient variant allows faster video generation without losing much accuracy.
- Routing ensures efficient use of different control experts for stability and meaning.
Where Pith is reading between the lines
- This lifting and routing idea could be tested in other domains involving articulated objects, such as generating videos of human movements or industrial robots.
- Real-time applications in surgical training simulators might become feasible with the latency reductions.
- The new benchmark dataset could serve as a standard for evaluating future video generation methods in medicine.
- The hierarchical routing might inspire similar selective mechanisms in other conditional generation tasks like text-to-video.
Load-bearing premise
Articulated kinematics can be lifted into a unified set of five image-aligned control modalities that provide all necessary information for precise control over video generation.
What would settle it
Running the model on a new set of kinematic inputs and observing that the output video frames show tool positions or movements that do not match the intended actions, such as incorrect grasping or cutting locations.
Figures
read the original abstract
Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a kinematic-to-visual lifting approach that maps articulated robot kinematics into a fixed set of five image-aligned control modalities. These modalities feed a hierarchically routed visual control framework that dynamically selects relevant control signals and motion scales via routing, augmented by kinematic-prior-guided loss terms that promote physical consistency and expert sparsity. A budgeted training/inference scheme exploits the resulting sparsity for lower latency. The authors also release a new surgical video benchmark with human-in-the-loop articulated annotations obtained via differentiable pose tracking. Experiments are reported to show gains in action faithfulness, visual fidelity, and cross-domain generalization relative to baselines, with an efficient variant preserving accuracy at reduced compute.
Significance. If the lifting step is shown to be information-preserving and the routing mechanism is validated by ablation, the framework could meaningfully improve controllable video synthesis for robotic surgery training and simulation. The new benchmark with articulated labels is a concrete, reusable contribution that addresses a data gap in the field.
major comments (2)
- [Abstract / §3] Abstract and §3 (method): the central claim rests on the assertion that articulated kinematics can be losslessly lifted into exactly five image-aligned control modalities. No enumeration of these modalities is supplied, no argument is given for completeness with respect to depth, occlusion, or non-rigid deformation, and no ablation compares performance with four versus five (or six) modalities. Without this, reported improvements in action faithfulness cannot be attributed to the proposed framework rather than to an incomplete representation.
- [§4] §4 (experiments): the quantitative tables claim consistent gains over diverse baselines, yet the manuscript provides no per-modality ablation or routing-sparsity analysis that isolates the contribution of the hierarchical routing and kinematic-prior losses. The efficiency numbers for the budgeted variant are presented without corresponding control-accuracy curves at different sparsity levels, making it impossible to verify the claimed latency-accuracy trade-off.
minor comments (2)
- [§3.1] Notation for the five modalities and the routing gates should be introduced with explicit symbols and a small diagram in §3.1 rather than left implicit.
- [§4.1] The new benchmark section should include a table listing the number of videos, average length, and annotation statistics (e.g., number of articulated joints labeled per frame).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. We appreciate the positive assessment of the benchmark contribution and the potential of the overall framework. Below we respond point-by-point to the two major comments. We will perform a major revision that incorporates additional clarifications, enumerations, and ablations as outlined.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (method): the central claim rests on the assertion that articulated kinematics can be losslessly lifted into exactly five image-aligned control modalities. No enumeration of these modalities is supplied, no argument is given for completeness with respect to depth, occlusion, or non-rigid deformation, and no ablation compares performance with four versus five (or six) modalities. Without this, reported improvements in action faithfulness cannot be attributed to the proposed framework rather than to an incomplete representation.
Authors: We agree that the manuscript would benefit from greater explicitness here. The paper does not claim the lifting is lossless; it presents an effective, practical mapping. We will revise the abstract and add a new subsection in §3 that (i) enumerates the five modalities (projected 2D joint positions, kinematic-derived optical flow, forward-kinematics depth, arm segmentation masks, and velocity fields), (ii) provides a concise argument for their sufficiency in the rigid-tool surgical setting while acknowledging limitations for non-rigid tissue deformation and heavy occlusion, and (iii) includes a new ablation table comparing 4-, 5-, and 6-modality variants on action-faithfulness metrics. These additions will allow readers to attribute performance gains more precisely. revision: yes
-
Referee: [§4] §4 (experiments): the quantitative tables claim consistent gains over diverse baselines, yet the manuscript provides no per-modality ablation or routing-sparsity analysis that isolates the contribution of the hierarchical routing and kinematic-prior losses. The efficiency numbers for the budgeted variant are presented without corresponding control-accuracy curves at different sparsity levels, making it impossible to verify the claimed latency-accuracy trade-off.
Authors: We accept this critique and will strengthen the experimental section. The revised manuscript will add (i) a per-modality ablation that measures the incremental effect of each control signal on faithfulness and fidelity metrics, (ii) a routing-sparsity analysis reporting expert activation rates and their correlation with the kinematic-prior losses, and (iii) control-accuracy versus sparsity curves for the budgeted variant across multiple sparsity thresholds, together with the corresponding latency measurements. These results will be placed in §4 and the supplementary material. revision: yes
Circularity Check
No circularity: new lifting paradigm and routing framework are independently proposed without reduction to inputs or self-citations.
full rationale
The abstract and described method introduce a kinematic-to-visual lifting into five modalities, hierarchical routing, kinematic-prior-guided losses, and a budgeted scheme as novel elements. These do not reduce by definition or construction to fitted parameters, prior self-citations, or renamed known results. The new benchmark is built via external human-in-the-loop labeling and tracking, supplying independent supervision. Experiments claim improvements over baselines on faithfulness and generalization without any load-bearing step that equates outputs to inputs by fiat. This is a standard non-circular proposal of a new control representation and architecture.
Axiom & Free-Parameter Ledger
free parameters (1)
- five image-aligned control modalities
axioms (1)
- domain assumption Low-dimensional control vectors can govern complex image evolution when lifted to image-aligned modalities
invented entities (2)
-
hierarchically routed visual control framework
no independent evidence
-
kinematic-prior-guided routing loss functions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025
Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025
-
[2]
Weakly - supervised diagnosis and detection of breast cancer using deep multiple instance learning,
Nicolás Ayobi, Alejandra Pérez-Rondón, Santiago Rodríguez, and Pablo Arbeláes. Matis: Masked-attention transformers for surgical instrument segmentation. In2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5, 2023. doi: 10.1109/ ISBI53787.2023.10230819
-
[3]
Hierasurg: Hierarchy-aware diffusion model for surgical video generation
Diego Biagini, Nassir Navab, and Azade Farshad. Hierasurg: Hierarchy-aware diffusion model for surgical video generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 310–319. Springer, 2025
work page 2025
-
[4]
Daniel Caballero, Juan A Sánchez-Margallo, and Francisco M Sánchez-Margallo. Generative ai for synthetic surgical training videos.British Journal of Surgery, 113(3):znag017, 2026
work page 2026
-
[5]
Medical video generation for disease progression simulation.arXiv preprint arXiv:2411.11943, 2024
Xu Cao, Kaizhao Liang, Kuei-Da Liao, Tianren Gao, Wenqian Ye, Jintai Chen, Zhiguang Ding, Jianguo Cao, James M Rehg, and Jimeng Sun. Medical video generation for disease progression simulation.arXiv preprint arXiv:2411.11943, 2024
-
[6]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018
work page 2018
-
[7]
Qi Chen, Kai Qian, Zhan-Xuan Hu, Yong-Hang Tai, and Zheng-Tao Yu. H-rssg: High- fidelity robotic surgical scene generation with implicit deformable neural radiance field.IEEE Transactions on Automation Science and Engineering, 23:3353–3364, 2025
work page 2025
-
[8]
Rui Chen, Zehuan Wu, Yichen Liu, Yuxin Guo, Jingcheng Ni, Haifeng Xia, and Siyu Xia. Unimlvg: Unified framework for multi-view long video generation with comprehensive control capabilities for autonomous driving, 2025. URLhttps://arxiv.org/abs/2412.04842
-
[9]
Surgsora: Object-aware diffusion model for controllable surgical video generation
Tong Chen, Shuya Yang, Junyi Wang, Long Bai, Hongliang Ren, and Luping Zhou. Surgsora: Object-aware diffusion model for controllable surgical video generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 521–531. Springer, 2025
work page 2025
-
[10]
Llama-vg: A video vision llama-based model for endoscopy video generation
Yueyao Chen, Zheng Han, and Qi Dou. Llama-vg: A video vision llama-based model for endoscopy video generation. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025
work page 2025
-
[11]
Surgical workflow image generation based on generative adversarial networks
Yuwen Chen, Kunhua Zhong, Fei Wang, Hongqian Wang, and Xueliang Zhao. Surgical workflow image generation based on generative adversarial networks. In2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), pages 82–86. IEEE, 2018
work page 2018
-
[12]
Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, and Jiebo Luo. How far are surgeons from surgical world models? a pilot study on zero-shot surgical video generation with expert assessment.arXiv preprint arXiv:2511.01775, 2025
-
[13]
Schwing, Alexander Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, 2022
work page 2022
-
[14]
Joseph Cho, Samuel Schmidgall, Cyril Zakka, Mrudang Mathur, Dhamanpreet Kaur, Ro- han Shad, and William Hiesinger. Surgen: Text-guided diffusion model for surgical video generation.arXiv preprint arXiv:2408.14028, 2024
-
[15]
Jennifer A Eckhoff, Guy Rosman, Maria S Altieri, Stefanie Speidel, Danail Stoyanov, Mehran Anvari, Lena Meier-Hein, Keno März, Pierre Jannin, Carla Pugh, et al. Sages consensus recommendations on surgical video data use, structure, and exploration (for research in artificial intelligence, clinical quality improvement, and surgical education).Surgical Endo...
work page 2023
-
[16]
Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. Pyslowfast. https://github.com/facebookresearch/slowfast, 2020
work page 2020
-
[17]
Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
-
[18]
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
Junhu Fu, Ke Chen, Weidong Guo, Shuyu Liang, Jie Xu, Chen Ma, Kehao Wang, Shengli Lin, Zeju Li, Yuanyuan Wang, et al. Depthpilot: From controllability to interpretability in colonoscopy video generation.arXiv preprint arXiv:2604.26232, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang, Kehao Wang, Ke Chen, Shengli Lin, Pinghong Zhou, Zeju Li, et al. Colodiff: Integrating dynamic consistency with content awareness for colonoscopy video generation.IEEE Transactions on Medical Imaging, 2026
work page 2026
-
[20]
Pam: A pose-appearance-motion engine for sim-to-real hoi video generation, 2026
Mingju Gao, Kaisen Yang, Huan ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, and Hao Zhao. Pam: A pose-appearance-motion engine for sim-to-real hoi video generation, 2026. URLhttps://arxiv.org/abs/2603.22193
-
[21]
Magicdrive: Street view generation with diverse 3d geometry control
Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InICLR, 2024
work page 2024
-
[22]
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2025
work page 2025
-
[23]
Unraveling the effects of synthetic data on end-to-end autonomous driving
Junhao Ge, Zuhong Liu, Longteng Fan, Yifan Jiang, Jiaqi Su, Yiming Li, Zhejun Zhang, and Siheng Chen. Unraveling the effects of synthetic data on end-to-end autonomous driving. arXiv preprint arXiv:2503.18108, 2025
-
[24]
Hugo Georgenthum, Cristian Cosentino, Fabrizio Marozzo, and Pietro Liò. Enhancing surgical documentation through multimodal visual-temporal transformers and generative ai.arXiv preprint arXiv:2504.19918, 2025
-
[25]
Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv preprint arXiv:2503.15208, 2025
-
[27]
Cosmos-h-surgical: Learning surgical robot policies from videos via world modeling, 2026
Yufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko, Dillan Imans, Bingjie Liu, Dongren Yang, Mingxue Gu, Yongnan Ji, Yueming Jin, Ren Zhao, Baiyong Shen, and Daguang Xu. Cosmos-h-surgical: Learning surgical robot policies from videos via world modeling, 2026. URLhttps://arxiv.org/abs/2512.23162
-
[28]
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025
-
[29]
Surgen-net: A generative approach for surgical vqa with structured text generation
Yongjun Jeon, Seonmin Park, Jongmin Shin, Kanggil Park, Bogeun Kim, Namkee Oh, and Kyu-Hwan Jung. Surgen-net: A generative approach for surgical vqa with structured text generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1292–1299, 2025
work page 2025
-
[30]
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024
-
[31]
Juseong Jin and Chang Wook Jeong. Surgical-llava: Toward surgical scenario understanding via large language and vision models.arXiv preprint arXiv:2410.09750, 2024. 13
-
[32]
Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita, and Mariko Isogawa. Disturbance-free surgical video generation from multi-camera shadowless lamps for open surgery.arXiv preprint arXiv:2512.08577, 2025
-
[33]
Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra PK Poudel, and Binod Bhattarai. Surgical vision world model. InMICCAI Work- shop on Data Engineering in Medical Imaging, pages 1–10. Springer, 2025
work page 2025
-
[34]
Sangria: surgical video scene graph optimization for surgical workflow prediction
Ça˘ghan Köksal, Ghazal Ghazaei, Felix Holm, Azade Farshad, and Nassir Navab. Sangria: surgical video scene graph optimization for surgical workflow prediction. InInternational Workshop on Graphs in Biomedical Image Analysis, pages 106–117. Springer, 2024
work page 2024
-
[35]
arXiv preprint arXiv:2509.07996 (2025)
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025
-
[36]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Modeling disease progression with diffusion-based generative models
Meryem Mine Kurt. Modeling disease progression with diffusion-based generative models. Master’s thesis, Middle East Technical University (Turkey), 2025
work page 2025
-
[38]
Eung-Joo Lee, William Plishker, Xinyang Liu, Timothy Kane, Shuvra S Bhattacharyya, and Raj Shekhar. Segmentation of surgical instruments in laparoscopic videos: training dataset generation and deep-learning-based framework. InMedical Imaging 2019: Image-Guided Procedures, Robotic Interventions, and Modeling, volume 10951, pages 461–469. SPIE, 2019
work page 2019
-
[39]
Uniscene: Unified occupancy-centric driving scene generation
Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11971–11981, 2025
work page 2025
-
[40]
Endosparse: Real-time sparse view synthesis of endoscopic scenes using gaussian splatting
Chenxin Li, Brandon Y Feng, Yifan Liu, Hengyu Liu, Cheng Wang, Weihao Yu, and Yixuan Yuan. Endosparse: Real-time sparse view synthesis of endoscopic scenes using gaussian splatting. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 252–262. Springer, 2024
work page 2024
-
[41]
Endora: Video generation models as endoscopy simulators
Chenxin Li, Hengyu Liu, Yifan Liu, Brandon Y Feng, Wuyang Li, Xinyu Liu, Zhen Chen, Jing Shao, and Yixuan Yuan. Endora: Video generation models as endoscopy simulators. In International conference on medical image computing and computer-assisted intervention, pages 230–240. Springer, 2024
work page 2024
-
[42]
Jiajie Li, Garrett Skinner, Gene Yang, Brian R Quaranto, Steven D Schwaitzberg, Peter CW Kim, and Jinjun Xiong. Llava-surg: towards multimodal surgical assistant via structured surgical video learning.arXiv preprint arXiv:2408.07981, 2024
-
[43]
Artificial intelligence for biomedical video generation.arXiv preprint arXiv:2411.07619, 2024
Linyuan Li, Jianing Qiu, Anujit Saha, Lin Li, Poyuan Li, Mengxian He, Ziyu Guo, and Wu Yuan. Artificial intelligence for biomedical video generation.arXiv preprint arXiv:2411.07619, 2024
-
[44]
arXiv preprint arXiv:2506.02265 (2025) 9
Samuel Li, Pujith Kachana, Prajwal Chidananda, Saurabh Nair, Yasutaka Furukawa, and Matthew Brown. Rig3r: Rig-aware conditioning for learned 3d reconstruction.arXiv preprint arXiv:2506.02265, 2025
-
[45]
Ophora: a large-scale data-driven text-guided ophthalmic surgical video generation model
Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijing Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, and Junjun He. Ophora: a large-scale data-driven text-guided ophthalmic surgical video generation model. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 425–435. Springer, 2025
work page 2025
-
[46]
Yaoqian Li, Xikai Yang, Dunyuan Xu, Yang YU, Litao Zhao, Xiaowei Hu, Jinpeng Li, and Pheng-Ann Heng. Surgpub-video: A comprehensive surgical video framework for enhanced surgical intelligence in vision-language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6628–6635, 2026. 14
work page 2026
-
[47]
WorldLens: Full-spectrum evaluations of driving world models in real world,
Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, et al. Worldlens: Full-spectrum evaluations of driving world models in real world.arXiv preprint arXiv:2512.10958, 2025
-
[48]
Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, et al. Diffusionrenderer: Neural inverse and forward rendering with video diffusion models.arXiv preprint arXiv:2501.18590, 2025
-
[49]
Chih-Hao Lin, Zian Wang, Ruofan Liang, Yuxuan Zhang, Sanja Fidler, Shenlong Wang, and Zan Gojcic. Controllable weather synthesis and removal with video diffusion models.arXiv preprint arXiv:2505.00704, 2025
-
[50]
Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal. Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model, 2024
work page 2024
-
[51]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Robotransfer: Controllable geometry- consistent video diffusion for manipulation policy transfer,
Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, and Zhizhong Su. Robotransfer: Geometry-consistent video diffusion for robotic visual policy transfer.arXiv preprint arXiv:2505.23171, 2025
-
[53]
Endogen: Conditional autoregressive endoscopic video generation
Xinyu Liu, Hengyu Liu, Cheng Wang, Tianming Liu, and Yixuan Yuan. Endogen: Conditional autoregressive endoscopic video generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 169–179. Springer, 2025
work page 2025
-
[54]
Video swin transformer.arXiv preprint arXiv:2106.13230, 2021
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer.arXiv preprint arXiv:2106.13230, 2021
-
[55]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[56]
Video content analysis of surgical procedures.Surgical endoscopy, 32 (2):553–568, 2018
Constantinos Loukas. Video content analysis of surgical procedures.Surgical endoscopy, 32 (2):553–568, 2018
work page 2018
-
[57]
Ning Ma, Shu Yang, Yizhao Zhou, Chaoyang Zhang, Jian Chen, and Xiaoman He. Open- world surgical video generation via dual-visual diffusion and dual-annealed generation.Neural Networks, page 108281, 2025
work page 2025
-
[58]
Lennart Maack and Alexander Schlaefer. An approach to enriching surgical video datasets for fine-grained spatial-temporal understanding of vision-language models.arXiv preprint arXiv:2604.00784, 2026
-
[59]
Sabina Martyniak, Joanna Kaleta, Diego Dall’Alba, Michal Naskrket, Szymon Płotka, and Przemysław Korzeniowski. Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models.2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4268–4278, 2024. URL https://api. semanticscholar.org/C...
work page 2025
-
[60]
Sabina Martyniak, Joanna Kaleta, Diego Dall’Alba, Michał Naskr˛ et, Szymon Płotka, and Przemysław Korzeniowski. Simuscope: Realistic endoscopic synthetic dataset generation through surgical simulation and diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4268–4278. IEEE, 2025
work page 2025
-
[61]
Mixture of experts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014
Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey.Artificial Intelligence Review, 42(2):275–293, 2014
work page 2014
-
[62]
Zhe Min, Jiewen Lai, and Hongliang Ren. Innovating robot-assisted surgery through large vision models.Nature Reviews Electrical Engineering, 2(5):350–363, 2025
work page 2025
-
[63]
Open-H-Embodiment Consortium Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, and etc. Dianye Huang. Open-h-embodiment: A large-scale dataset for en- abling foundation models in medical robotics, 2026. URLhttps://api.semanticscholar. org/CorpusID:287702178. 15
work page 2026
-
[64]
Takuya Ozawa, Yuichiro Hayashi, Hirohisa Oda, Masahiro Oda, Takayuki Kitasaka, Nobuyoshi Takeshita, Masaaki Ito, and Kensaku Mori. Synthetic laparoscopic video generation for machine learning-based surgical instrument segmentation from real laparoscopic video and virtual surgical instruments.Computer Methods in Biomechanics and Biomedical Engineering: Ima...
work page 2021
-
[65]
Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, et al. Sar-rarp50: Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostatectomy challenge.arXiv preprint arXiv:2401.00496, 2023
-
[66]
Saw: Toward a surgical action world model via controllable and scalable video generation, 2026
Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Pengfei Guo, Daguang Xu, and Mathias Unberath. Saw: Toward a surgical action world model via controllable and scalable video generation, 2026. URLhttps://arxiv.org/abs/2603.13024
-
[67]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Motion generation of robotic surgical tasks: Learning from expert demonstrations
Carol E Reiley, Erion Plaku, and Gregory D Hager. Motion generation of robotic surgical tasks: Learning from expert demonstrations. In2010 Annual international conference of the IEEE engineering in medicine and biology, pages 967–970. IEEE, 2010
work page 2010
-
[69]
Grounded sam: Assembling open-world models for diverse visual tasks, 2024
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024
work page 2024
-
[70]
Ariel Rodriguez, Chenpan Li, Lorenzo Mazza, Rayan Younis, Ortrun Hellig, Sebastian Bodenstedt, Martin Wagner, and Stefanie Speidel. Lar-moe: Latent-aligned routing for mixture of experts in robotic imitation learning.arXiv preprint arXiv:2603.08476, 2026
-
[71]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015
work page 2015
-
[72]
Sharona B Ross, Aryan Modasi, Maria Christodoulou, Iswanto Sucandy, Anvari Mehran, Thom E Lobe, Elan Witkowski, and Richard Satava. New generation evaluations: video-based surgical assessments: a technology update.Surgical endoscopy, 37(10):7401–7411, 2023
work page 2023
-
[73]
Binary cross entropy with deep learning technique for image classification.Int
Usha Ruby, Vamsidhar Yendapalli, et al. Binary cross entropy with deep learning technique for image classification.Int. J. Adv. Trends Comput. Sci. Eng, 9(10), 2020
work page 2020
-
[74]
Yann Sakref, Lalithkumar Seenivasan, Hao Ding, Ruhika Iyer, Danush Kumar Venkatesh, Stefanie Speidel, Mathias Unberath, Jeffrey K Jopling, and Lisa Marie Knowlton. Empowering surgeons with integrated synthetic data: solutions for mastering complex clinical scenarios. npj Digital Medicine, 2026
work page 2026
-
[75]
R3d-18 for ucf-101 action recognition
Saumya Saksena. R3d-18 for ucf-101 action recognition. https://huggingface.co/ dronefreak/r3d-18-ucf101, 2024
work page 2024
-
[76]
Sg2vid: Scene graphs enable fine-grained control for video synthesis
Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, and Anirban Mukhopadhyay. Sg2vid: Scene graphs enable fine-grained control for video synthesis. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 511–
-
[77]
Bora: biomedical generalist video generation model.arXiv preprint arXiv:2407.08944, 2024
Weixiang Sun, Xiaocao You, Ruizhe Zheng, Zhengqing Yuan, Xiang Li, Lifang He, Quanzheng Li, and Lichao Sun. Bora: biomedical generalist video generation model.arXiv preprint arXiv:2407.08944, 2024
-
[78]
Mohamad-Hani Temsah, Rakan Nazer, Ibraheem Altamimi, Raniah Aldekhyyel, Amr Jamal, Mohammad Almansour, Fadi Aljamaan, Khalid Alhasan, Abdulkarim A Temsah, Ayman Al-Eyadhy, et al. Openai’s sora and google’s veo 2 in action: a narrative review of artificial intelligence-driven video generation models transforming healthcare.Cureus, 17(1):e77593, 2025. 16
work page 2025
-
[79]
Mehmet Kerem Turkcan, Mattia Ballo, Filippo Filicori, and Zoran Kostic. Towards sutur- ing world models: Learning predictive models for robotic surgical tasks.arXiv preprint arXiv:2503.12531, 2025
-
[80]
Towards holistic surgical scene understanding
Natalia Valderrama, Paola Ruiz, Isabela Hernández, Nicolás Ayobi, Mathilde Verlyck, Jessica Santander, Juan Caicedo, Nicolás Fernández, and Pablo Arbeláez. Towards holistic surgical scene understanding. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 442–452, Cham, 2022. Springer Nature Switzerland
work page 2022
-
[81]
Danush Kumar Venkatesh, Adam Schmidt, Muhammad Abdullah Jamal, and Omid Mohareri. Mitigating surgical data imbalance with dual-prediction video diffusion model.arXiv preprint arXiv:2510.07345, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.