Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture
Pith reviewed 2026-05-21 13:53 UTC · model grok-4.3
The pith
Ultra-lightweight models under 0.3 million parameters enable real-time colonoscopic polyp segmentation directly on commodity CPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The UltraSeg architecture replaces heavy standard components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion to produce CPU-native models below 0.3M parameters that deliver real-time throughput above 50 FPS at 256x256 resolution, Dice scores exceeding 0.8 on seven datasets, and performance that approaches or exceeds a 7.76M-parameter UNet on zero-shot external tests while using only 1.7 percent of its parameters.
What carries the argument
Grouped multi-rate dilated convolutions paired with attention-gated cross-layer fusion, which together capture multi-scale features and fuse them across layers while keeping the total parameter count extremely low.
If this is right
- Real-time inference exceeds 50 FPS at 256x256 and 30 FPS at 352x352 on a single CPU core.
- The 130K model substantially outperforms every other published competitor that also stays under 0.3M parameters.
- Scaling the same design principles to 4.38M parameters yields accuracy competitive with heavyweight state-of-the-art models while retaining a large efficiency lead.
- The resulting CPU-native pipeline supplies an immediately usable tool for clinical sites without GPU hardware.
Where Pith is reading between the lines
- The same design choices could be tested on other real-time endoscopic or ultrasound tasks where GPU access is limited.
- The extreme parameter reduction opens the possibility of running the model on embedded processors inside portable colonoscopes.
- Further zero-shot tests on data from different patient demographics would clarify how far the generalization extends.
Load-bearing premise
The specific pairing of grouped multi-rate dilated convolutions with attention-gated cross-layer fusion produces better internal representations than other lightweight designs, and these gains hold on data and conditions outside the seven datasets tested.
What would settle it
On a new external colonoscopy dataset collected under different imaging conditions or equipment, the UltraSeg-130K model falls below 0.75 Dice score or loses its speed advantage over a comparably sized standard lightweight network.
Figures
read the original abstract
Real-time polyp segmentation is essential for early colorectal cancer detection, yet clinical deployment remains blocked by GPU dependency. We introduce the UltraSeg family, a set of CPU-native segmentation models operating below 0.3M parameters. UltraSeg-108K (0.108M) establishes the extreme-compression frontier, while UltraSeg-130K (0.130M) integrates cross-layer lightweight fusion for enhanced multi-center generalization. The architecture replaces parameter-heavy components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion, achieving real-time throughput on a single CPU core (exceeding 50 FPS at 256*256 and 30 FPS at 352*352) without sacrificing clinical-grade accuracy. Evaluated on seven public datasets, UltraSeg-130K attains Dice scores exceeding 0.8 at both resolutions, substantially outperforming all existing sub-0.3M competitors. Notably, it approaches or exceeds UNet-Medium (7.76M parameters) on zero-shot external validations while using only 1.7% of its parameters, establishing the first strong baseline for CPU-native real-time polyp segmentation. When scaled to 4.38M parameters, UltraSeg achieves accuracy competitive with heavyweight state-of-the-art models while maintaining an order-of-magnitude parameter advantage, demonstrating that the proposed design principles yield intrinsic representational gains across the entire efficiency spectrum. By delivering the first clinically deployable, CPU-native real-time solution, this work provides an immediately usable tool for resource-limited settings and a reproducible blueprint for real-time medical AI beyond endoscopy. Source code is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the UltraSeg family of ultra-lightweight CNN architectures (UltraSeg-108K at 0.108M parameters and UltraSeg-130K at 0.130M parameters) for real-time colonoscopic polyp segmentation on commodity CPUs. It replaces standard components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion, claiming Dice scores exceeding 0.8 on seven public datasets at 256x256 and 352x352 resolutions, >50 FPS and >30 FPS respectively on a single CPU core, substantial outperformance of all sub-0.3M competitors, and performance approaching or exceeding UNet-Medium (7.76M parameters) on zero-shot external validations while using only 1.7% of its parameters. When scaled to 4.38M parameters the design remains competitive with heavyweight SOTA models; source code is released publicly.
Significance. If the performance and generalization claims hold after proper validation, the work would be significant for enabling GPU-free real-time polyp segmentation in clinical settings, particularly resource-limited environments. It provides the first strong CPU-native baseline for this task and demonstrates that the proposed design principles can scale across efficiency regimes. Public code release is a clear strength that supports reproducibility.
major comments (1)
- [Experiments] The central claim that grouped multi-rate dilated convolutions and attention-gated cross-layer fusion produce intrinsic representational gains responsible for Dice >0.8, outperformance of sub-0.3M baselines, and zero-shot generalization to external sets is load-bearing. The manuscript contains no controlled ablation tables or variants that swap only these modules while freezing training data, augmentation, optimizer, and all other hyperparameters (see Experiments section and any associated tables reporting Dice/FPS). Without such isolations the attribution of gains to architecture rather than data or protocol factors remains unsecured.
minor comments (2)
- [Abstract] The abstract and results text reference 'seven public datasets' without naming them; listing the exact dataset names and splits would improve clarity and allow readers to assess potential overlap with training data.
- [Results] No statistical significance tests, standard deviations, or confidence intervals are reported for the Dice scores or FPS measurements across the seven datasets and zero-shot validations; adding these would strengthen the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment below and commit to strengthening the experimental section accordingly.
read point-by-point responses
-
Referee: The central claim that grouped multi-rate dilated convolutions and attention-gated cross-layer fusion produce intrinsic representational gains responsible for Dice >0.8, outperformance of sub-0.3M baselines, and zero-shot generalization to external sets is load-bearing. The manuscript contains no controlled ablation tables or variants that swap only these modules while freezing training data, augmentation, optimizer, and all other hyperparameters (see Experiments section and any associated tables reporting Dice/FPS). Without such isolations the attribution of gains to architecture rather than data or protocol factors remains unsecured.
Authors: We agree that the current experiments do not fully isolate the contributions of the grouped multi-rate dilated convolutions and attention-gated cross-layer fusion through controlled module swaps under fixed training protocols. While the reported comparisons to sub-0.3M baselines and larger models were performed with consistent data splits, augmentation, and optimization settings, these do not constitute the strict isolations requested. In the revised manuscript we will add new ablation tables that systematically enable/disable or replace only these two modules while freezing all other factors (training data, augmentation, optimizer, learning rate schedule, and hyperparameters) and report the resulting Dice and FPS changes on the primary datasets. This will directly address the attribution concern. revision: yes
Circularity Check
No circularity: empirical architecture evaluated on external benchmarks
full rationale
The paper introduces an ultra-lightweight segmentation architecture (UltraSeg) using grouped multi-rate dilated convolutions and attention-gated fusion, then reports Dice scores and FPS directly from training and testing on seven public datasets plus zero-shot external validations. No equations, fitted parameters, or self-citations are shown to reduce the reported accuracy or speed metrics to quantities defined by the authors' own inputs. Performance claims rest on standard benchmark comparisons against prior models rather than any self-referential derivation or renaming of known results. The central design choices are presented as engineering substitutions whose gains are measured empirically, not derived by construction from the evaluation protocol itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Convolutional layers with grouped multi-rate dilation and attention-gated fusion can extract sufficient features for polyp segmentation at sub-0.3M parameter budgets.
Reference graph
Works this paper leans on
-
[1]
Ahmir Ahmad, Ana Wilson, Adam Haycock, Adam Humphries, Kevin Monahan, Noriko Suzuki, Siwan Thomas-Gibson, Margaret Vance, Paul Bassett, Kow- shika Thiruvilangam, et al. Evaluation of a real- time computer-aided polyp detection system during screening colonoscopy: Ai-detect study.Endoscopy, 55(04):313–319, 2023. 1
work page 2023
-
[2]
Polypgen: A multi-center polyp detection and segmentation dataset for gener- alisability assessment
S Ali, D Jha, N Ghatwary, S Realdon, R Canniz- zaro, OE Salem, D Lamarque, C Daul, MA Riegler, KV Anonsen, et al. Polypgen: A multi-center polyp detection and segmentation dataset for gener- alisability assessment. arxiv 2021.arXiv preprint arXiv:2106.04463. 7
-
[3]
Jorge Bernal, Javier S ´anchez, and Fernando Vilarino. Towards automatic polyp detection with a polyp ap- pearance model.Pattern Recognition, 45(9):3166– 3182, 2012. 1
work page 2012
-
[4]
Jorge Bernal, F. Javier S ´anchez, Gloria Fern ´andez- Esparrach, Debora Gil, Cristina Rodr ´ıguez, and Fer- nando Vilari˜no. Cvc-clinicdb, 2015. 7
work page 2015
-
[5]
Debayan Bhattacharya, Konrad Reuter, Finn Behrendt, Lennart Maack, Sarah Grube, and Alexan- der Schlaefer. Polypnextlstm: a lightweight and fast polyp video segmentation network using convnext and convlstm.International journal of computer assisted radiology and surgery, 19(10):2111–2119,
-
[6]
Carlo Biffi, Pietro Salvagnini, Nhan Ngo Dinh, Cesare Hassan, Prateek Sharma, and Andrea Cherubini. A novel ai device for real-time optical characterization of colorectal polyps.NPJ digital medicine, 5(1):84,
-
[7]
Hanna Borgli, Vajira Thambawita, Pia H Smed- srud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Math- ias Lux, Duc Tien Dang Nguyen, et al. Hyperk- vasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy.Scientific data, 7(1):283, 2020. 2
work page 2020
-
[8]
Imagenet: A large-scale hierarchi- cal image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchi- cal image database. In2009 IEEE conference on com- puter vision and pattern recognition, pages 248–255. Ieee, 2009. 13
work page 2009
-
[9]
Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li, Huazhu Fu, and Ling Shao. Polyp-pvt: Polyp seg- mentation with pyramid vision transformers.arXiv preprint arXiv:2108.06932, 2021. 1, 2, 3
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, and Sylvain Gelly. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 1
work page 2021
-
[11]
Using duck-net for polyp image seg- mentation.Scientific reports, 13(1):9803, 2023
Razvan-Gabriel Dumitru, Darius Peteleaza, and Catalin Craciun. Using duck-net for polyp image seg- mentation.Scientific reports, 13(1):9803, 2023. 7, 12
work page 2023
-
[12]
Wilson WB Goh, Kendrick Y A Chia, Max FK Che- ung, Kalya M Kee, May O Lwin, Peter J Schulz, Minhu Chen, Kaichun Wu, Simon SM Ng, Rashid Lui, et al. Risk perception, acceptance, and trust of using ai in gastroenterology practice in the asia- pacific region: web-based survey study.JMIR AI, 3(1):e50525, 2024. 1
work page 2024
-
[13]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst confer- ence on language modeling, 2024. 1, 3
work page 2024
-
[14]
Seung-Min Jeong, Seung-Gun Lee, Chae-Lin Seok, Eui-Chul Lee, and Jun-Young Lee. Lightweight deep learning model for real-time colorectal polyp segmen- tation.Electronics, 12(9):1962, 2023. 3
work page 1962
-
[15]
Hicks, Vajira Thambawita, Enrique Garcia- Ceja, Michael A
Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A. Hicks, Vajira Thambawita, Enrique Garcia- Ceja, Michael A. Riegler, Thomas de Lange, Peter T. Schmidt, H˚avard D. Johansen, Dag Johansen, and P˚al Halvorsen. Kvasir-instrument: Diagnostic and ther- apeutic tool segmentation dataset in gastrointestinal endoscopy. InMultiMedia Modeling, pages 218–229, Ch...
work page 2021
-
[16]
Kvasir-seg: A segmented polyp dataset
Debesh Jha, Pia H Smedsrud, Michael A Riegler, P˚al Halvorsen, Thomas de Lange, Dag Johansen, and H˚avard D Johansen. Kvasir-seg: A segmented polyp dataset. InInternational Conference on Multimedia Modeling, pages 451–462. Springer, 2020. 7
work page 2020
-
[17]
Polypdb: A curated multi-center dataset for development of ai algorithms in colonoscopy
Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ri- tika K Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, et al. Polypdb: A curated multi-center dataset for development of ai algorithms in colonoscopy. arXiv preprint arXiv:2409.00045, 2024. 2, 7
-
[18]
Martijn R Jong, Tim GW Boers, Kiki N Fock- ens, Jelmer B Jukema, Carolus HJ Kusters, Tim JM Jaspers, RAH van Eijck van Heslinga, Floor C Slooter, Maarten R Struyvenberg, Raf Bisschops, et al. Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy.Gas- troenterology, 2025. 1, 3
work page 2025
-
[19]
Colorectal polyp de- tection in colonoscopy images using yolo-v8 network
Mehrshad Lalinia and Ali Sahafi. Colorectal polyp de- tection in colonoscopy images using yolo-v8 network. Signal, Image and Video Processing, 18(3):2047– 2058, 2024. 1
work page 2047
-
[20]
Ming-De Li, Ze-Rong Huang, Quan-Yuan Shan, Shu- Ling Chen, Ning Zhang, Hang-Tong Hu, and Wei Wang. Performance and comparison of artificial intel- ligence and human experts in the detection and clas- sification of colonic polyps.BMC gastroenterology, 22(1):517, 2022. 1
work page 2022
-
[21]
Xiaomeng Li, Mengyu Jia, Md Tauhidul Islam, Lequan Yu, and Lei Xing. Self-supervised feature learning via exploiting multi-modal data for retinal disease diagnosis.IEEE Transactions on Medical Imaging, 39(12):4023–4033, 2020. 3
work page 2020
-
[22]
Long Lin, Guangzu Lv, Bin Wang, Cunlu Xu, and Jun Liu. Polyp-lvt: Polyp segmentation with lightweight vision transformers.Knowledge-Based Systems, 300:112181, 2024. 2, 3, 13
work page 2024
-
[23]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 3
work page 2021
-
[24]
Eileen Morgan, Melina Arnold, A Gini, V Loren- zoni, CJ Cabasag, Mathieu Laversanne, Jerome Vig- nat, Jacques Ferlay, Neil Murphy, and Freddie Bray. Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from globocan.Gut, 72(2):338–344, 2023. 1
work page 2020
-
[25]
Zhiyuan Niu, Zhuo Deng, Weihao Gao, Shurui Bai, Zheng Gong, Chucheng Chen, Fuju Rong, Fang Li, and Lan Ma. Fnexter: a multi-scale feature fusion net- work based on convnext and transformer for retinal oct fluid segmentation.Sensors, 24(8):2425, 2024. 3
work page 2024
-
[26]
Xing-Liang Pan, Ju-Rong Ding, Xia Li, Shuo Liu, Jie Wang, Bo Hua, Guo-Zhi Tang, and Chang-Hua Zhong. Msbp-net: A multi-scale boundary prediction network for automated polyp segmentation.Pattern Recognition, 170:112101, 2026. 2
work page 2026
-
[27]
FitNets: Hints for Thin Deep Nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014. Published at ICLR
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
U-net: Convolutional networks for biomedical im- age segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical im- age segmentation. InMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 1, 3
work page 2015
-
[29]
Ege-unet: an efficient group enhanced unet for skin lesion segmentation
Jiacheng Ruan, Mingye Xie, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. InIn- ternational conference on medical image computing and computer-assisted intervention, pages 481–490. Springer, 2023. 2, 3, 4
work page 2023
-
[30]
Colorectal malignant polyps: a modern approach.Annals of Gastroenterology, 35(1):17, 2021
Sofia Saraiva, Isadora Rosa, Ricardo Fonseca, and Ant´onio Dias Pereira. Colorectal malignant polyps: a modern approach.Annals of Gastroenterology, 35(1):17, 2021. 1
work page 2021
-
[31]
T. H. Son and P. D. Hung. Polyps segmentation in colonoscopy images using SegFormer transformer. InInternational Conference on Artificial Intelligence and Soft Computing, pages 368–378, Cham, June
-
[32]
Springer Nature Switzerland. 7, 12
-
[33]
Fiseha B Tesema, Alejandro Guerra Manzanares, Tianxiang Cui, Qian Zhang, Moses Solomon, and Sean He. Lgps: A lightweight gan-based approach for polyp segmentation in colonoscopy images.arXiv preprint arXiv:2503.18294, 2025. 2, 3
-
[34]
Aling Wang, Jiahao Mo, Cailing Zhong, Shaohua Wu, Sufen Wei, Binqi Tu, Chang Liu, Daman Chen, Qing Xu, Mengyi Cai, et al. Artificial intelligence-assisted detection and classification of colorectal polyps under colonoscopy: a systematic review and meta-analysis. Annals of Translational Medicine, 9(22):1662, 2021. 1
work page 2021
-
[35]
Shuo Wang, Yue Zhu, Xiao Luo, Zhen Yang, Yichao Zhang, Ping Fu, Chen Liu, and Yanjun Guo. Knowledge extraction and distillation from large-scale image-text colonoscopy records leverag- ing large language and vision models.arXiv preprint arXiv:2310.11173, 2023. 3
-
[36]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 568–578, 2021. 3
work page 2021
-
[37]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyra- mid vision transformer.Computational visual media, 8(3):415–424, 2022. 3, 13
work page 2022
-
[38]
Mamba-unet: Unet- like pure visual mamba for medical image segmentation,
Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li. Mamba-unet: Unet-like pure visual mamba for medical image segmentation.arXiv preprint arXiv:2402.05079, 2024. 3
-
[39]
Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical image segmentation.Medical Im- age Analysis, 102, 2025. 3
work page 2025
-
[40]
Segmamba: Long-range sequential model- ing mamba for 3d medical image segmentation
Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential model- ing mamba for 3d medical image segmentation. In International conference on medical image computing and computer-assisted intervention, pages 578–588. Springer, 2024. 3
work page 2024
-
[41]
A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188,
Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier Gonz ´alez, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188,
-
[42]
Lb-unet: A lightweight boundary-assisted unet for skin lesion segmentation
Jiahao Xu and Lyuyang Tong. Lb-unet: A lightweight boundary-assisted unet for skin lesion segmentation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 361–371. Springer, 2024. 2, 3, 4
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.