arxiv: 2604.18134 · v2 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Chengan Che , Chao Wang , Jiayuan Huang , Xinyue Chen , Luis C. Garcia-Peraza-Herrera

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical visionvision-language pre-trainingLLM-generated textcontrastive alignmentnoisy supervisionLoRA adaptationmedical video analysis

0 comments

The pith

LLM-generated narratives from surgical videos support scalable vision-language pre-training when paired with confidence-weighted alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can supply the text side of vision-language training for surgery, where expert annotations are scarce. It builds a large dataset of automated narratives from public surgical videos and then introduces a training method that adapts existing visual encoders with low-rank updates while estimating and down-weighting unreliable text during contrastive alignment. On two standard surgical benchmarks the resulting model matches zero-shot cross-modal retrieval performance of prior approaches yet keeps the linear probing accuracy of the original visual encoder intact. The work therefore claims that noisy but plentiful LLM text can substitute for expert text without sacrificing the medical priors already learned by visual foundation models.

Core claim

SurgLIME learns reliable cross-modal alignments from noisy LLM-generated surgical narratives by using a LoRA-adapted dual-encoder architecture and an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment, achieving competitive zero-shot performance on the AutoLaparo and Cholec80 benchmarks while preserving the linear probing performance of the visual foundation model.

What carries the argument

SurgLIME framework: a parameter-efficient dual-encoder that applies LoRA adaptation to the visual backbone and uses an automated confidence score to re-weight each text sample inside the contrastive loss.

If this is right

Surgical vision-language models can be pre-trained at much larger scale than before because text labels no longer require expert time.
Existing visual foundation models for surgery can be extended to text-aware tasks without retraining the entire visual encoder from scratch.
Zero-shot retrieval and reasoning between surgical video and language become practical on standard benchmarks.
The same confidence-weighting pattern may allow other medical domains to use automatically generated text for multi-modal training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the confidence mechanism generalizes, similar pipelines could be applied to other video domains that already have strong visual encoders but lack paired text.
The approach opens the possibility of continuously updating surgical models from new operating-room footage without repeated expert annotation campaigns.
A natural next test is whether the same framework improves performance on downstream surgical tasks such as phase recognition or tool detection when text supervision is added.

Load-bearing premise

The automated confidence scores can correctly identify and reduce the influence of hallucinated or erroneous LLM text without discarding useful information or introducing new biases into the alignment.

What would settle it

A verification set of human-checked surgical narratives where the model's confidence scores show no correlation with actual text accuracy, or where training on the full noisy set produces worse zero-shot alignment than training on a small clean subset.

Figures

Figures reproduced from arXiv: 2604.18134 by Chao Wang, Chengan Che, Jiayuan Huang, Luis C. Garcia-Peraza-Herrera, Xinyue Chen.

**Figure 1.** Figure 1: Overview of our proposed LIME dataset and SurgLIME framework. We first employ a Large Language Model (Gemini) to generate narratives for surgical video clips from LEMON [6], establishing the LIME dataset(Sec. 3). Within the SurgLIME architecture (Sec. 4), a frozen vision encoder (PL-Stitch [5]) equipped with LoRA [13] extracts frame-level embeddings Hvid. These are dynamically aggregated by a Temporal Atte… view at source ↗

**Figure 2.** Figure 2: Overview of the LIME dataset construction pipeline. a) The process begins with standardizing raw videos from the LEMON dataset through resizing and center-cropping. b) Long-form videos are then partitioned using TransNetV2 for shot boundary detection, followed by c) a 5-second sliding window approach to generate temporal segments. d) High-clarity clips are selected via an automatic Laplacian-based filter, … view at source ↗

read the original abstract

Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at https://github.com/visurg-ai/SurgLIME.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LIME, a large-scale multi-modal dataset of LLM-generated narratives from open-access surgical videos, and SurgLIME, a parameter-efficient VLP framework using a LoRA-adapted dual-encoder architecture plus an automated confidence estimation mechanism to dynamically down-weight uncertain text during contrastive alignment. The central claim is that this enables reliable cross-modal learning from noisy LLM text, achieving competitive zero-shot alignment on AutoLaparo and Cholec80 benchmarks while preserving the linear probing performance of the base visual foundation model. Dataset, code, and models are released publicly.

Significance. If the results hold, the work has substantial significance for surgical computer vision by addressing the expert-annotation bottleneck via scalable LLM-generated text with explicit noise mitigation. The emphasis on preserving medical visual priors is a key strength for downstream clinical applications. Public release of data, code, and models is a clear positive for reproducibility and community follow-up.

major comments (2)

[§3.2] §3.2 (Automated Confidence Estimation): The mechanism for estimating and down-weighting text uncertainty is load-bearing for the central claim of reliable alignment without degrading visual priors, yet the manuscript provides no quantitative validation of its hallucination-detection accuracy, no ablation comparing weighted vs. unweighted contrastive loss, and no analysis of potential new biases introduced by the weighting. This leaves the mitigation strategy under-supported relative to the abstract's assertions.
[§5] §5 (Experiments on AutoLaparo and Cholec80): The claim of 'competitive zero-shot cross-modal alignment' and preserved linear probing performance requires explicit numerical results, baseline comparisons (including prior surgical VLP methods), and ablations on text-error rates; the current presentation does not include these details, making it difficult to assess whether the outcome is robust or merely consistent with the visual encoder alone.

minor comments (2)

[Abstract] Abstract: The phrase 'competitive' performance is imprecise; adding one or two concrete metrics (e.g., recall@5 or zero-shot accuracy deltas) would improve clarity without altering the narrative.
[§3] Notation and figures: Ensure all symbols in the contrastive loss formulation (e.g., temperature, weighting function) are defined at first use and that confidence-estimation diagrams include axis labels and legend entries for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and will revise the paper accordingly to address the concerns raised. Below, we provide point-by-point responses.

read point-by-point responses

Referee: [§3.2] §3.2 (Automated Confidence Estimation): The mechanism for estimating and down-weighting text uncertainty is load-bearing for the central claim of reliable alignment without degrading visual priors, yet the manuscript provides no quantitative validation of its hallucination-detection accuracy, no ablation comparing weighted vs. unweighted contrastive loss, and no analysis of potential new biases introduced by the weighting. This leaves the mitigation strategy under-supported relative to the abstract's assertions.

Authors: We agree that the automated confidence estimation mechanism requires stronger empirical support to substantiate its effectiveness in mitigating noise from LLM-generated text. The original submission focused on describing the method and its integration into the SurgLIME framework but omitted detailed quantitative evaluations. In the revised manuscript, we will include: (1) quantitative validation of the hallucination-detection accuracy using a subset of manually verified texts, reporting precision, recall, and F1 scores; (2) an ablation study directly comparing the contrastive loss with and without the dynamic weighting; and (3) an analysis of potential biases by evaluating performance stratified by surgical procedure and text length. These additions will provide a more robust validation of the noise-robust framework. revision: yes
Referee: [§5] §5 (Experiments on AutoLaparo and Cholec80): The claim of 'competitive zero-shot cross-modal alignment' and preserved linear probing performance requires explicit numerical results, baseline comparisons (including prior surgical VLP methods), and ablations on text-error rates; the current presentation does not include these details, making it difficult to assess whether the outcome is robust or merely consistent with the visual encoder alone.

Authors: We acknowledge that the experimental section would benefit from more explicit and comprehensive reporting. While the manuscript states that SurgLIME achieves competitive zero-shot alignment and preserves linear probing performance, specific numerical values, comparisons, and ablations were not fully detailed. In the revision, we will expand §5 to include: tables with exact metrics such as zero-shot recall@1, recall@5, and accuracy on both AutoLaparo and Cholec80; comparisons to relevant baselines including prior surgical vision-language pre-training approaches where applicable, as well as the base visual encoder without language alignment; and ablations that systematically vary text-error rates (e.g., by simulating different levels of noise in the narratives) to demonstrate the robustness of the confidence weighting. This will clarify that the results are due to the proposed framework rather than the visual model alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical VLP framework

full rationale

The paper describes an empirical pipeline: LLM-generated narratives form the LIME dataset, SurgLIME applies LoRA-adapted dual encoders plus automated confidence weighting inside standard contrastive loss, and reports benchmark numbers on AutoLaparo and Cholec80. No equations, uniqueness theorems, or first-principles derivations appear in the provided text; performance claims rest on external public benchmarks and public artifacts rather than quantities defined solely by the paper's own fitted parameters or self-referential definitions. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from contrastive vision-language learning and parameter-efficient adaptation; no new entities are postulated and free parameters appear limited to implementation choices not detailed in the abstract.

free parameters (1)

confidence weighting threshold
The automated confidence estimation likely uses at least one tunable parameter to decide down-weighting strength, though its value or selection process is not specified in the abstract.

axioms (2)

domain assumption Contrastive alignment remains effective when uncertain or noisy text is dynamically down-weighted
Invoked in the design of SurgLIME to mitigate hallucinations while learning cross-modal alignments.
domain assumption LoRA adaptation of the visual encoder preserves foundational medical priors learned from self-supervised pre-training
Stated as the mechanism that maintains robust linear probing performance on surgical tasks.

pith-pipeline@v0.9.0 · 5535 in / 1499 out tokens · 71399 ms · 2026-05-10T05:46:01.589293+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 4 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.arXiv, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Am- mar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

2025
[2]

EndoViT: pretraining vision transformers on a large collection of endoscopic images.International Journal of Computer Assisted Radiology and Surgery, 19(6): 1085–1091, 2024

Dominik Bati ´c, Felix Holm, Ege ¨Ozsoy, Tobias Czempiel, and Nassir Navab. EndoViT: pretraining vision transformers on a large collection of endoscopic images.International Journal of Computer Assisted Radiology and Surgery, 19(6): 1085–1091, 2024. 1

2024
[3]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. InPro- ceedings of the 34th International Conference on Neural In- formation Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 2

2020
[4]

Emerg- ing Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9630–9640. IEEE, 2021. 6

2021
[5]

Garcia-Peraza-Herrera

Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, and Luis C. Garcia-Peraza-Herrera. A Stitch in Time: Learn- ing Procedural Workflow via Self-Supervised Plackett-Luce Ranking.arXiv, 2025. 1, 2, 3, 4, 5, 6

2025
[6]

Garcia-Peraza-Herrera

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, and Luis C. Garcia-Peraza-Herrera. LEMON: A Large En- doscopic MONocular Dataset and Foundation Model for Per- ception in Surgical Settings.arXiv, 2025. 1, 2, 3, 4, 5, 6

2025
[7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th Inter- national Conference on Machine Learning. JMLR.org, 2020. 2, 5

2020
[8]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved Baselines with Momentum Contrastive Learning. arXiv, 2020

2020
[9]

Unsupervised Hyper- spectral Image Super-Resolution via Self-Supervised Modal- ity Decoupling.International Journal of Computer Vision,

Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised Hyper- spectral Image Super-Resolution via Self-Supervised Modal- ity Decoupling.International Journal of Computer Vision,
[10]

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multi- modal Reasoning.arXiv, 2025

Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, and Hongwei Bran Li. Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multi- modal Reasoning.arXiv, 2025. 2

2025
[11]

Domain-Specific Language Model Pre- training for Biomedical Natural Language Processing.ACM Trans

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-Specific Language Model Pre- training for Biomedical Natural Language Processing.ACM Trans. Comput. Healthcare, 3(1), 2021. 1, 2, 4, 5, 6

2021
[12]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Dollar, and Ross Girshick. Masked Autoencoders Are Scalable Vision Learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988. IEEE, 2022. 6

2022
[13]

LoRA: Low-Rank Adaptation of Large Language Models

J Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models.ArXiv, abs/2106.09685, 2021. 1, 2, 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Jaspers, Ronald L.P.D

Tim J.M. Jaspers, Ronald L.P.D. de Jong, Yiping Li, Car- olus H.J. Kusters, Franciscus H.A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P.W. Pluim, Peter H.N. de With, Marcel Breeuwer, Yasmina Al Khalil, and Fons van der Sommen. Scaling up self-supervised learn- ing for improved sur...

2026
[15]

Survey of hallucination in natural language generation.ACM Comput

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12), 2023. 2

2023
[16]

Scaling Up Visual and Vision-Language Representa- tion Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Representa- tion Learning With Noisy Text Supervision. InInternational Conference on Machine Learning, 2021. 2, 3

2021
[17]

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Got- mare, Shafiq R Joty, Caiming Xiong, and Steven C H Hoi. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. InNeural Informa- tion Processing Systems, 2021. 2

2021
[18]

BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C H Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. InIn- ternational Conference on Machine Learning, 2022. 2

2022
[19]

Meireles, Guy Rosman, Maria S

Ozanan R. Meireles, Guy Rosman, Maria S. Altieri, Lawrence Carin, Gregory Hager, Amin Madani, Nicolas Padoy, Carla M. Pugh, Patricia Sylla, Thomas M. Ward, and Daniel A. Hashimoto. SAGES consensus recommendations on an annotation framework for surgical video.Surgical En- doscopy, 35(9):4918–4929, 2021. 2

2021
[20]

A systematic review of annotation for surgical pro- cess model analysis in minimally invasive surgery based on video.Surgical Endoscopy, 37(6):4298–4314, 2023

Krystel Nyangoh Timoh, Arnaud Huaulme, Kevin Cleary, Myra A Zaheer, Vincent Lavou ´e, Dan Donoho, and Pierre Jannin. A systematic review of annotation for surgical pro- cess model analysis in minimally invasive surgery based on video.Surgical Endoscopy, 37(6):4298–4314, 2023. 2

2023
[21]

Sur- gLaVi: Large-scale hierarchical dataset for surgical vision- language representation learning.Medical Image Analysis, 110:103982, 2026

Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdullah Jamal. Sur- gLaVi: Large-scale hierarchical dataset for surgical vision- language representation learning.Medical Image Analysis, 110:103982, 2026. 2, 5

2026
[22]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InInterna- tional Conference on Machine Learning, 2021. 2, 5, 6

2021
[23]

LAION-5B: an open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...

2022
[24]

Conceptual Captions: A Cleaned, Hypernymed, Im- age Alt-text Dataset For Automatic Image Captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A Cleaned, Hypernymed, Im- age Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Lin- guistics, 2018. 2

2018
[25]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11218–11221, 2024. 3

2024
[26]

Regionaligner: Bridging ego-exo views for object correspondence via unified text- visual learning

Yuhao Su and Ehsan Elhamifar. Regionaligner: Bridging ego-exo views for object correspondence via unified text- visual learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3265–3274, 2026. 2

2026
[27]

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, et al. Medgrpo: Multi-task reinforcement learning for hetero- geneous medical video understanding.arXiv preprint arXiv:2512.06581, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Parwani, and Muhammad Khalid Khan Niazi

Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Anil V . Parwani, and Muhammad Khalid Khan Niazi. Streamline pathology foundation model by cross-magnification distilla- tion.arXiv, 2025. 2

2025
[29]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy

Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. En- doNet: A Deep Architecture for Recognition Tasks on La- paroscopic Videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2017. 2, 5

2017
[31]

Representation Learning with Contrastive Predictive Coding

A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14549–14560. IEEE, 2023. 2, 6

2023
[33]

AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided Sur- gical Automation in Laparoscopic Hysterectomy

Ziyi Wang, Bo Lu, Yonghao Long, Fangxun Zhong, Tak- Hong Cheung, Qi Dou, and Yunhui Liu. AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided Sur- gical Automation in Laparoscopic Hysterectomy. InMedi- cal Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 486–496, Cham, 2022. Springer Na- ture Switzerland. 2, 5

2022
[34]

Foun- dation Model for Endoscopy Video Analysis via Large-Scale Self-supervised Pre-train

Zhao Wang, Chang Liu, Shaoting Zhang, and Qi Dou. Foun- dation Model for Endoscopy Video Analysis via Large-Scale Self-supervised Pre-train. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2023, pages 101– 111, Cham, 2023. Springer Nature Switzerland. 2

2023
[35]

Challenges in surgical video annotation.Computer Assisted Surgery, 26 (1):58–68, 2021

Thomas M Ward, Danyal M Fer, Yutong Ban, Guy Rosman, Ozanan R Meireles, and Daniel A Hashimoto. Challenges in surgical video annotation.Computer Assisted Surgery, 26 (1):58–68, 2021. 2

2021
[36]

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models.arXiv, 2026

Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, and Jie Song. SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models.arXiv, 2026. 2

2026
[37]

Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy

Kun Yuan, Vinkle Srivastav, Tong Yu, Jo ¨el L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy. Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Im- age Analysis, 105:103644, 2025. 3, 5, 6

2025
[38]

Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring.arXiv,

Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, and Haijun Zhang. Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring.arXiv,
[39]

Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning.arXiv, 2026

Dongxu Zhang, Yujun Wu, Yiding Sun, Jinnan Yang, Ning Yang, Jihua Zhu, Miao Xin, and Baoliang Tian. Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning.arXiv, 2026. 2

2026
[40]

iBOT: Image BERT Pre- Training with Online Tokenizer.International Conference on Learning Representations (ICLR), 2022

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT Pre- Training with Online Tokenizer.International Conference on Learning Representations (ICLR), 2022. 6

2022
[41]

Can we trust AI doctors? a survey of medical hallucination in large language and large vision- language models

Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, Qingqing Long, Yefeng Zheng, and Xian Wu. Can we trust AI doctors? a survey of medical hallucination in large language and large vision- language models. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 6748–6769, Vienna, Austria, 2025. Associatio...

2025