Recognition: 2 theorem links
· Lean TheoremSurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
Pith reviewed 2026-05-16 07:13 UTC · model grok-4.3
The pith
SurgMotion learns surgical video understanding by predicting latent motion rather than reconstructing pixels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SurgMotion is a video-native foundation model built on V-JEPA that replaces pixel-level reconstruction with latent motion prediction. It introduces motion-guided latent masked prediction to focus on meaningful regions, spatiotemporal affinity self-distillation to maintain relational consistency, and spatiotemporal feature diversity regularization to avoid collapse in texture-sparse scenes. Pretrained on the 3658-hour SurgMotion-15M dataset spanning 13 anatomical regions, the model outperforms prior methods with 14.6 percent F1 improvement on EgoSurgery workflow recognition, 10.3 percent on PitVis, and 39.54 percent mAP-IVT on CholecT50 action triplets.
What carries the argument
Motion-guided latent masked prediction that directs learning toward semantically meaningful regions instead of low-level visual noise.
If this is right
- Higher accuracy on surgical workflow recognition without task-specific fine-tuning.
- Stronger recognition of action triplets that describe tool-tissue interactions.
- Improved performance on skill assessment and visual tasks such as polyp segmentation and depth estimation.
- A scalable pretraining recipe that works across 50 video sources and 13 anatomical regions.
Where Pith is reading between the lines
- The same motion-focused objective could transfer to other high-noise video domains such as underwater or endoscopic imaging.
- Large curated surgical datasets may become standard benchmarks for testing motion-centric video models.
- Reducing emphasis on pixel reconstruction may lower the data and compute needed to reach usable representations.
- Representations built this way could support real-time surgical assistance systems that react to procedure semantics.
Load-bearing premise
Prioritizing latent motion prediction will capture semantically meaningful structures without discarding low-level cues needed for some downstream tasks.
What would settle it
A pixel-reconstruction model trained on the identical SurgMotion-15M dataset that matches or exceeds SurgMotion's scores on the 17 benchmarks would show the motion-prediction shift is not required.
Figures
read the original abstract
While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details, such as smoke, specular reflections, and fluid motion, rather than semantic structures essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion introduces three key technical innovations tailored to surgical videos: (1) motion-guided latent masked prediction to prioritize semantically meaningful regions, (2) spatiotemporal affinity self-distillation to enforce relational consistency, and (3) spatiotemporal feature diversity regularization (SFDR) to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate SurgMotion-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that SurgMotion significantly outperforms state-of-the-art methods on surgical workflow recognition, achieving 14.6 percent improvement in F1 score on EgoSurgery and 10.3 percent on PitVis; on action triplet recognition with 39.54 percent mAP-IVT on CholecT50; as well as on skill assessment, polyp segmentation, and depth estimation. These results establish SurgMotion as a new standard for universal, motion-oriented surgical video understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SurgMotion, a video-native foundation model for surgical video understanding built on V-JEPA. It replaces pixel-level reconstruction with latent motion prediction and introduces three surgical-specific components: motion-guided latent masked prediction, spatiotemporal affinity self-distillation, and spatiotemporal feature diversity regularization (SFDR). The authors curate SurgMotion-15M, a new 3,658-hour dataset from 50 sources, and report large gains over prior methods on 17 benchmarks, including +14.6% F1 on EgoSurgery workflow recognition, +10.3% on PitVis, and 39.54% mAP-IVT on CholecT50 action triplet recognition, plus gains on skill assessment, segmentation, and depth estimation.
Significance. If the attribution of gains to the proposed motion-oriented objectives and components is substantiated, the work would provide a new pretraining paradigm and the largest public surgical video corpus to date, with potential to improve downstream performance across workflow, action, and skill tasks in computer-assisted surgery.
major comments (1)
- [Abstract and §4] Abstract and §4 (Experiments): the headline improvements (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are credited to the shift from pixel reconstruction plus the three innovations, yet no control experiment is reported that trains an unmodified V-JEPA baseline on the identical SurgMotion-15M corpus; without this isolation the gains cannot be unambiguously attributed to the methodological changes rather than the 3,658-hour scale and diversity of the new data.
minor comments (2)
- [Abstract] The abstract states specific percentage gains but supplies no error bars, number of runs, or statistical significance tests; these details should be added to the main experimental tables.
- [§3.3] Notation for the SFDR coefficient and motion threshold is introduced without an explicit hyper-parameter table or sensitivity analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that isolating the contributions of our proposed components from the scale of SurgMotion-15M is important and will add the requested control experiment in the revision.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline improvements (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are credited to the shift from pixel reconstruction plus the three innovations, yet no control experiment is reported that trains an unmodified V-JEPA baseline on the identical SurgMotion-15M corpus; without this isolation the gains cannot be unambiguously attributed to the methodological changes rather than the 3,658-hour scale and diversity of the new data.
Authors: We acknowledge the validity of this concern. The current manuscript does not include a direct ablation of unmodified V-JEPA trained on the full SurgMotion-15M corpus, which limits the strength of attribution to the motion-guided masking, affinity distillation, and SFDR components. In the revised version we will add this baseline: we will train V-JEPA from scratch on SurgMotion-15M using its original pixel-reconstruction objective and report its performance on all 17 downstream benchmarks alongside the SurgMotion results. This control will be presented in a new table in §4 and referenced in the abstract and discussion, allowing readers to quantify the incremental benefit of the surgical-specific objectives over dataset scale alone. revision: yes
Circularity Check
No circularity: claims rest on external empirical benchmarks
full rationale
The paper introduces SurgMotion by extending the external V-JEPA architecture with three described components (motion-guided masked prediction, spatiotemporal affinity self-distillation, SFDR) and a new curated dataset SurgMotion-15M. All headline performance numbers (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are presented as direct comparisons against prior published state-of-the-art methods on public benchmarks. No equations, self-definitions, fitted-parameter renamings, or self-citation chains are supplied that would reduce any claimed prediction or uniqueness result to the paper's own inputs by construction. The derivation chain is therefore self-contained against external references.
Axiom & Free-Parameter Ledger
free parameters (2)
- masking ratio and motion threshold
- SFDR strength coefficient
axioms (2)
- domain assumption Latent motion prediction captures semantic surgical structures better than pixel reconstruction
- standard math V-JEPA joint embedding architecture transfers to surgical video without major modification
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, Aczél classification)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
shifts the learning paradigm from pixel-level reconstruction to latent motion prediction... motion-guided latent masked prediction... spatiotemporal affinity self-distillation... spatiotemporal feature diversity regularization (SFDR)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Built on the Video Joint Embedding Predictive Architecture (V-JEPA)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[3]
Masked autoencoders as spatiotemporal learners,
C. Feichtenhofer, Y . Li, K. Heet al., “Masked autoencoders as spatiotemporal learners,”Advances in neural information processing systems, vol. 35, pp. 35 946–35 958, 2022
work page 2022
-
[4]
Endovit: pretraining vision transformers on a large collection of endoscopic images,
D. Bati ´c, F. Holm, E. ¨Ozsoy, T. Czempiel, and N. Navab, “Endovit: pretraining vision transformers on a large collection of endoscopic images,”International Journal of Computer Assisted Radiology and Surgery, vol. 19, no. 6, pp. 1085–1091, 2024
work page 2024
-
[5]
Foundation model for endoscopy video analysis via large- scale self-supervised pre-train,
Z. Wang, C. Liu, S. Zhang, and Q. Dou, “Foundation model for endoscopy video analysis via large- scale self-supervised pre-train,” inInternational Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2023, pp. 101–111
work page 2023
-
[6]
General surgery vision transformer: A video pre- trained foundation model for general surgery,
S. Schmidgall, J. W. Kim, J. Jopling, and A. Krieger, “General surgery vision transformer: A video pre- trained foundation model for general surgery,”arXiv preprint arXiv:2403.05949, 2024
-
[7]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,
Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 10 078–10 093
work page 2022
-
[8]
Videomae v2: Scaling video masked autoencoders with dual masking,
L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 14 549–14 560
work page 2023
-
[9]
Dissecting self-supervised learning methods for surgical computer vision,
S. Ramesh, V . Srivastav, D. Alapatt, T. Yu, A. Murali, L. Sestini, C. I. Nwoye, I. Hamoud, S. Sharma, A. Fleurentinet al., “Dissecting self-supervised learning methods for surgical computer vision,”Medical Image Analysis, vol. 88, p. 102844, 2023
work page 2023
-
[10]
Endonet: a deep architecture for recognition tasks on laparoscopic videos,
A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,”IEEE transactions on medical imaging, vol. 36, no. 1, pp. 86–97, 2016
work page 2016
-
[11]
Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery,
A. Das, D. Z. Khan, D. Psychogyios, Y . Zhang, J. G. Hanrahan, F. Vasconcelos, Y . Pang, Z. Chen, J. Wu, X. Zouet al., “Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery,” Medical Image Analysis, p. 103716, 2025
work page 2023
-
[12]
Egosurgery-phase: a dataset of surgical phase recognition from egocentric open surgery videos,
R. Fujii, M. Hatano, H. Saito, and H. Kajita, “Egosurgery-phase: a dataset of surgical phase recognition from egocentric open surgery videos,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 187–196
work page 2024
-
[13]
Revisiting Feature Prediction for Learning Visual Representations from Video
A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting feature prediction for learning visual representations from video,”arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholuset al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,” arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Bootstrap your own latent: A new approach to self-supervised learn- ing,
J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azaret al., “Bootstrap your own latent: A new approach to self-supervised learn- ing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 21 271–21 284. 20
work page 2020
-
[16]
Internvideo2: Scaling video foundation models for multimodal video understanding,
Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, J. Xu, Z. Wang, Y . Shi, T. Jiang, S. Li, H. Zhang, Y . Huang, Y . Qiao, Y . Wang, and L. Wang, “Internvideo2: Scaling video foundation models for multimodal video understanding,” inEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[17]
Internvideo-next: Towards general video foundation models without video-text supervision,
C. Wang, K. Li, Y . He, Y . Wang, Z. Yan, J. Yu, Y . Wang, and L. Wang, “Internvideo-next: Towards general video foundation models without video-text supervision,”arXiv preprint arXiv:2512.01342, 2025
-
[18]
Emerging properties in self-supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660
work page 2021
-
[19]
M. Seitzeret al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy,
M. R. Jong, T. G. Boers, K. N. Fockens, J. B. Jukema, C. H. Kusters, T. J. Jaspers, R. v. E. van Hes- linga, F. C. Slooter, M. R. Struyvenberg, R. Bisschopset al., “Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy,”Gastroenterology, 2025
work page 2025
-
[21]
Self-supervised learning for endoscopic video analysis,
R. Hirsch, M. Caron, R. Cohen, A. Livne, R. Shapiro, T. Golany, R. Goldenberg, D. Freedman, and E. Rivlin, “Self-supervised learning for endoscopic video analysis,” inInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 569–578
work page 2023
-
[22]
Endomamba: an efficient founda- tion model for endoscopic videos via hierarchical pre-training,
Q. Tian, H. Liao, X. Huang, B. Yang, D. Lei, S. Ourselin, and H. Liu, “Endomamba: an efficient founda- tion model for endoscopic videos via hierarchical pre-training,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 224–234
work page 2025
-
[23]
Scaling up self-supervised learning for improved surgical foundation models,
T. J. Jaspers, R. L. de Jong, Y . Li, C. H. Kusters, F. H. Bakker, R. C. van Jaarsveld, G. M. Kuiper, R. van Hillegersberg, J. P. Ruurda, W. M. Brinkmanet al., “Scaling up self-supervised learning for improved surgical foundation models,”arXiv preprint arXiv:2501.09436, 2025
-
[24]
Learn- ing multi-modal representations by watching hundreds of surgical video lectures,
K. Yuan, V . Srivastav, T. Yu, J. L. Lavanchy, J. Marescaux, P. Mascagni, N. Navab, and N. Padoy, “Learn- ing multi-modal representations by watching hundreds of surgical video lectures,”Medical Image Analy- sis, p. 103644, 2025
work page 2025
-
[25]
The TUM LapChole dataset for the M2CAI 2016 workflow challenge
R. Stauder, D. Ostler, M. Kranzfelder, S. Koller, H. Feußner, and N. Navab, “The tum lapchole dataset for the m2cai 2016 workflow challenge,”arXiv preprint arXiv:1610.09278, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
C. I. Nwoye, T. Yu, C. Gonzalez, B. Seeliger, P. Mascagni, D. Mutter, J. Marescaux, and N. Padoy, “Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos,” Medical Image Analysis, vol. 78, p. 102433, 2022
work page 2022
-
[27]
Z. Wang, B. Lu, Y . Long, F. Zhong, T.-H. Cheung, Q. Dou, and Y . Liu, “Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy,” inInterna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 486–496
work page 2022
-
[28]
D. Guo, W. Si, Z. Li, J. Pei, and P.-A. Heng, “Surgical workflow recognition and blocking effectiveness detection in laparoscopic liver resection with pringle maneuver,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3220–3228
work page 2025
-
[29]
Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding,
M. Hu, P. Xia, L. Wang, S. Yan, F. Tang, Z. Xu, Y . Luo, K. Song, J. Leitner, X. Chenget al., “Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024
work page 2024
-
[30]
Analyzing surgical technique in diverse open surgical videos with multitask machine learning,
E. D. Goodman, K. K. Patel, Y . Zhang, W. Locke, C. J. Kennedy, R. Mehrotra, S. Ren, M. Guan, O. Zohar, M. Downinget al., “Analyzing surgical technique in diverse open surgical videos with multitask machine learning,”JAMA Surgery, 2024. 21
work page 2024
-
[31]
A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,
N. Ahmidi, L. Tao, S. Sefati, Y . Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager, “A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,”IEEE Transactions on Biomedical Engineering, vol. 64, no. 9, pp. 2025–2041, 2017
work page 2025
-
[32]
Aixsuture: vision-based assessment of open suturing skills,
H. Hoffmann, I. Funke, P. Peters, D. K. Venkatesh, J. Egger, D. Rivoir, R. R¨ohrig, F. H¨olzle, S. Bodenstedt, M.-C. Willemer, S. Speidel, and B. Puladi, “Aixsuture: vision-based assessment of open suturing skills,” International Journal of Computer Assisted Radiology and Surgery, vol. 19, no. 6, pp. 1045–1052, 2024
work page 2024
-
[33]
Video retrieval in laparoscopic video recordings with dynamic content descriptors,
K. Schoeffmann, H. Husslein, S. Kletz, S. Petscharnig, B. M ¨unzer, and C. Beecks, “Video retrieval in laparoscopic video recordings with dynamic content descriptors,”Multimedia Tools and Applications, vol. 77, no. 13, pp. 16 813–16 832, 2018
work page 2018
-
[34]
Y . Tian, G. Pang, F. Liu, Y . Liu, C. Wang, Y . Chen, J. Verjans, and G. Carneiro, “Contrastive transformer- based multiple instance learning for weakly supervised polyp frame detection,” inInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 88–98
work page 2022
-
[35]
J. Bernal, F. J. S ´anchez, G. Fern´andez-Esparrach, D. Gil, C. Rodr´ıguez, and F. Vilari˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Comput- erized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015
work page 2015
-
[36]
A. Rau, P. E. Edwards, O. F. Ahmad, P. Riordan, M. Janatka, L. B. Lovat, and D. Stoyanov, “Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy,” International Journal of Computer Assisted Radiology and Surgery, vol. 14, pp. 1167–1176, 2019
work page 2019
-
[37]
Colonoscopy 3d video dataset with paired depth from 2d-3d registration,
T. L. Bobrow, M. Golhar, R. Vijayan, V . S. Akshintala, J. R. Garcia, and N. J. Durr, “Colonoscopy 3d video dataset with paired depth from 2d-3d registration,”Medical Image Analysis, p. 102956, 2023
work page 2023
-
[38]
Cataracts: Challenge on automatic tool annotation for cataract surgery,
H. Al Hajj, M. Lamard, P.-H. Conze, B. Cochener, and G. Quellec, “Cataracts: Challenge on automatic tool annotation for cataract surgery,”Medical Image Analysis, vol. 52, pp. 24–41, 2019. [Online]. Available: https://dx.doi.org/10.21227/ac97-8m18
-
[39]
J. L. Lavanchy, S. Ramesh, D. Dall’Alba, C. Gonzalez, P. Fiorini, B. P. M ¨uller-Stich, P. C. Nett, J. Marescaux, D. Mutter, and N. Padoy, “Challenges in multi-centric generalization: phase and step recog- nition in roux-en-y gastric bypass surgery,”International Journal of Computer Assisted Radiology and Surgery, vol. 19, pp. 2249–2258, 2024
work page 2024
-
[40]
G. Wang, H. Xiao, R. Zhang, H. Gao, L. Bai, X. Yang, Z. Li, H. Li, and H. Ren, “Copesd: A multi- level surgical motion dataset for training large vision-language models to co-pilot endoscopic submucosal dissection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2024
work page 2024
-
[41]
Towards holistic surgical scene understanding,
N. Valderrama, O. Zisimopoulos, and S. Giannarou, “Towards holistic surgical scene understanding,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 442–452
work page 2023
-
[42]
Kvasir-seg: A segmented polyp dataset,
D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” inMultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 2020, pp. 451–462
work page 2020
-
[43]
A benchmark for endoluminal scene segmentation of colonoscopy images,
D. V ´azquez, J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, A. M. L ´opez, A. Romero, M. Drozdzal, and A. Courville, “A benchmark for endoluminal scene segmentation of colonoscopy images,”Journal of healthcare engineering, vol. 2017, no. 1, p. 4037190, 2017
work page 2017
-
[44]
Towards automatic polyp detection with a polyp appearance model,
J. Bernal, J. S ´anchez, and F. Vilarino, “Towards automatic polyp detection with a polyp appearance model,”Pattern Recognition, vol. 45, no. 9, pp. 3166–3182, 2012. 22
work page 2012
-
[45]
Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,
J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,”International journal of computer assisted radiology and surgery, vol. 9, no. 2, pp. 283–293, 2014
work page 2014
-
[46]
Pranet: Parallel reverse attention network for polyp segmentation,
D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” inInternational conference on medical image computing and computer- assisted intervention. Springer, 2020, pp. 263–273
work page 2020
-
[47]
Uacanet: Uncertainty augmented context attention for polyp segmentation,
T. Kim, H. Lee, and D. Kim, “Uacanet: Uncertainty augmented context attention for polyp segmentation,” inProceedings of the 29th ACM international conference on multimedia, 2021, pp. 2167–2175
work page 2021
-
[48]
Pranet-v2: Dual-supervised reverse attention for medical image segmentation,
B.-C. Hu, G.-P. Ji, D. Shao, and D.-P. Fan, “Pranet-v2: Dual-supervised reverse attention for medical image segmentation,”arXiv preprint arXiv:2504.10986, 2025. 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.