EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation
Pith reviewed 2026-05-21 15:20 UTC · model grok-4.3
The pith
EndoCaver jointly deblurs and segments endoscopic images to maintain high polyp detection accuracy despite fog, blur, and glare.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EndoCaver employs a unidirectional-guided dual-decoder transformer architecture that integrates a Global Attention Module for cross-scale feature aggregation, a Deblurring-Segmentation Aligner to pass restoration cues to the segmentation branch, and a cosine-based scheduler named LoCoS for balanced multi-task optimization. Experiments on the Kvasir-SEG dataset report Dice scores of 0.922 on clean images and 0.889 under simulated severe degradations, outperforming prior methods while reducing model parameters by 90 percent.
What carries the argument
Unidirectional-guided dual-decoder transformer with Global Attention Module (GAM) for multi-scale aggregation, Deblurring-Segmentation Aligner (DSA) for cue transfer, and LoCoS cosine scheduler for stable joint training.
If this is right
- Segmentation remains accurate without requiring a separate deblurring pre-processing stage.
- Model size drops by 90 percent, supporting direct on-device inference during procedures.
- Joint training lets restoration cues improve segmentation boundaries on degraded frames.
- The cosine scheduler stabilizes optimization when balancing deblurring and segmentation losses.
- Performance holds across clean and severely degraded versions of the same dataset.
Where Pith is reading between the lines
- The same dual-decoder pattern could extend to other paired medical tasks such as denoising plus vessel segmentation in retinal images.
- On-device deployment would lower data transmission needs and reduce patient privacy exposure.
- If the architecture generalizes, future endoscopic systems might omit dedicated restoration hardware entirely.
- Testing on additional datasets with real rather than simulated degradations would strengthen evidence for clinical use.
Load-bearing premise
The Kvasir-SEG dataset plus the added simulated degradations sufficiently match the distribution of fog, motion blur, and specular highlights found in real clinical endoscopic procedures.
What would settle it
Segmentation Dice scores measured on a fresh collection of un-simulated clinical endoscopic videos that contain natural lens fog, motion blur, and glare, with no retraining allowed.
read the original abstract
Endoscopic image analysis is vital for colorectal cancer screening, yet real-world conditions often suffer from lens fogging, motion blur, and specular highlights, which severely compromise automated polyp detection. We propose EndoCaver, a lightweight transformer with a unidirectional-guided dual-decoder architecture, enabling joint multi-task capability for image deblurring and segmentation while significantly reducing computational complexity and model parameters. Specifically, it integrates a Global Attention Module (GAM) for cross-scale aggregation, a Deblurring-Segmentation Aligner (DSA) to transfer restoration cues, and a cosine-based scheduler (LoCoS) for stable multi-task optimisation. Experiments on the Kvasir-SEG dataset show that EndoCaver achieves 0.922 Dice on clean data and 0.889 under severe image degradation, surpassing state-of-the-art methods while reducing model parameters by 90%. These results demonstrate its efficiency and robustness, making it well-suited for on-device clinical deployment. Code is available at https://github.com/ReaganWu/EndoCaver.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents EndoCaver, a lightweight transformer with a unidirectional-guided dual-decoder architecture for joint deblurring and segmentation of endoscopic images degraded by fog, blur, and glare. It incorporates a Global Attention Module (GAM) for cross-scale aggregation, a Deblurring-Segmentation Aligner (DSA) to transfer restoration cues, and a cosine-based LoCoS scheduler for multi-task optimization. On the Kvasir-SEG dataset, the model achieves Dice scores of 0.922 on clean images and 0.889 under severe simulated degradations while reducing parameters by 90% relative to state-of-the-art methods, positioning it for on-device clinical deployment in colorectal cancer screening.
Significance. If the performance generalizes, the work provides a practical, parameter-efficient solution for improving automated polyp detection under real-world endoscopic conditions. The joint multi-task design and specific modules (GAM, DSA, LoCoS) constitute a targeted contribution to medical image restoration and analysis. The reported 90% parameter reduction and concrete Dice numbers on a public dataset are strengths that could support deployment if robustness claims are substantiated beyond simulation.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The central claim of 0.889 Dice under severe degradation and suitability for clinical deployment rests on Kvasir-SEG images with artificially added fog, blur, and glare. Real endoscopic degradations arise from correlated physical processes (variable moisture films, non-uniform motion, specular reflections tied to tissue geometry and lighting) whose joint statistics are unlikely to be reproduced by independent simulation modules. No held-out real-degraded test set or cross-validation against procedure videos is described, leaving the gap between simulated and actual conditions unquantified and directly affecting whether the Dice scores and 90% parameter reduction hold outside the training distribution.
- [Experimental Results] Experimental Results: The abstract reports concrete Dice numbers (0.922 clean / 0.889 degraded) and a 90% parameter reduction claim, but provides no information on baseline implementations, statistical significance, or the exact degradation simulation protocol (e.g., parameters for fog density, blur kernel, glare intensity). This leaves the performance superiority over state-of-the-art methods resting on unverified experimental details.
minor comments (2)
- [Method] Method section: The LoCoS scheduler is described as cosine-based for stable multi-task optimisation; providing the explicit formulation or pseudocode would clarify how it balances the deblurring and segmentation losses beyond standard cosine annealing.
- [Figures and Tables] Figure captions and tables: Ensure all reported metrics include standard deviations or confidence intervals from multiple runs to support the numerical claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim of 0.889 Dice under severe degradation and suitability for clinical deployment rests on Kvasir-SEG images with artificially added fog, blur, and glare. Real endoscopic degradations arise from correlated physical processes (variable moisture films, non-uniform motion, specular reflections tied to tissue geometry and lighting) whose joint statistics are unlikely to be reproduced by independent simulation modules. No held-out real-degraded test set or cross-validation against procedure videos is described, leaving the gap between simulated and actual conditions unquantified and directly affecting whether the Dice scores and 90% parameter reduction hold outside the training distribution.
Authors: We acknowledge that simulated degradations, while following established models for fog, blur, and glare, may not fully capture the correlated nature of real endoscopic artifacts. Our approach uses a combination of these degradations to simulate challenging conditions commonly encountered in clinical settings. To address this, we will revise the manuscript to include a more explicit discussion of the simulation protocol's limitations and its relation to real-world conditions. Additionally, we will add references to prior works that have used similar simulation strategies for endoscopic image enhancement. We believe this provides a transparent view of the current evaluation while highlighting the method's potential. revision: partial
-
Referee: [Experimental Results] Experimental Results: The abstract reports concrete Dice numbers (0.922 clean / 0.889 degraded) and a 90% parameter reduction claim, but provides no information on baseline implementations, statistical significance, or the exact degradation simulation protocol (e.g., parameters for fog density, blur kernel, glare intensity). This leaves the performance superiority over state-of-the-art methods resting on unverified experimental details.
Authors: We agree that additional details are necessary to ensure reproducibility and to substantiate the claims. In the revised manuscript, we will expand the Experimental Results section to include: (1) the precise parameters and implementation details of the degradation simulation (fog density, blur kernel sizes and types, glare intensity and placement), (2) descriptions of how baseline methods were implemented or adapted, and (3) statistical analysis including standard deviations and significance tests for the reported metrics. These additions will allow readers to better evaluate the results. revision: yes
- Acquiring a dedicated held-out set of real degraded endoscopic images with expert-annotated segmentation masks for polyps would require significant additional resources and ethical approvals for data collection, which is beyond the immediate scope of this work but is noted as an important direction for future validation.
Circularity Check
No circularity: empirical performance metrics on public dataset with simulated degradations
full rationale
The paper reports direct empirical measurements (Dice scores of 0.922 clean / 0.889 severe on Kvasir-SEG with added fog/blur/glare) from a proposed lightweight transformer architecture. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the reported outputs. The central claims rest on standard train/test splits and simulated degradations rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. The architecture components (GAM, DSA, LoCoS) are presented as design choices evaluated experimentally, not as tautological redefinitions of the metrics.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-task loss balancing weights
axioms (1)
- domain assumption A transformer backbone with cross-scale attention can simultaneously restore and segment degraded medical images without task-specific conflicts.
invented entities (3)
-
Global Attention Module (GAM)
no independent evidence
-
Deblurring-Segmentation Aligner (DSA)
no independent evidence
-
LoCoS scheduler
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leandAlembert_cosh_solution_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
w_seg(t) = w_min + 1/2 (1-w_min)(1 + cos(π t / T)) ... cosine annealing-based loss scheduler (LoCoS)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EndoCaver ... 7.81M-parameter dual-decoder model ... 0.9221 Dice on clean ... 0.8893 under degraded
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation
INTRODUCTION Colorectal cancer is the third most common cancer world- wide, accounting for nearly 10% of all cancer cases, and the second leading cause of cancer-related deaths globally [1]. Early detection of colorectal polyps through endoscopy is an effective preventive strategy. However, real-world endoscopic imaging often suffers from severe quality d...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
METHODOLOGY 2.1. Overall Architecture EndoCaver is a lightweight dual-decoder transformer de- signed for jointendoscopic image deblurringandpolyp segmentationunder real-world degradations. As shown in Fig. 2(a), the framework consists of: (i) anMiT-B0 encoder[5] for efficient hierarchical representation, (ii) a Global Attention Module (GAM)for cross-scale...
-
[3]
Experimental Setup Our model is implemented in PyTorch and trained on a single NVIDIA A100 80G GPU
EXPERIMENTS AND RESULTS 3.1. Experimental Setup Our model is implemented in PyTorch and trained on a single NVIDIA A100 80G GPU. Input images are resized to224× 224with a batch size of 16. Training is with the Adam op- timizer, warmup, and cosine annealing learning rate schedule from1×10 −4 during epochs (Deblurring, Endocaver:3000, Segmentation:200). As ...
-
[4]
and CVC-ColonDB [18]. Segmentation is assessed by Dice, IoU, and Recall, while deblurring quality is measured by PSNR and SSIM (higher is better). Synthetic Degradations.To evaluate robustness under ad- verse imaging conditions, we generate degraded images with motion/defocus blur, specular highlights, and lens fogging us- ing randomly sampled parameters....
-
[5]
CONCLUSION In this paper, we propose EndoCaver, a lightweight dual- decoder transformer that jointly performs deblurring and seg- mentation for endoscopic images. The Global Attention Mod- ule enhances encoder features, the Deblurring-Segmentation Aligner transfers restoration cues to segmentation, and the cosine annealing loss scheduler adaptively balanc...
-
[6]
Eileen Morgan, Melina Arnold, A Gini, V Loren- zoni, CJ Cabasag, Mathieu Laversanne, Jerome Vignat, Jacques Ferlay, Neil Murphy, and Freddie Bray, “Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from globocan,”Gut, vol. 72, no. 2, pp. 338–344, 2023
work page 2020
-
[7]
Relevance segmentation of laparoscopic videos,
Bernd M ¨unzer, Klaus Schoeffmann, and Laszlo B¨osz¨ormenyi, “Relevance segmentation of laparoscopic videos,” in2013 IEEE international symposium on mul- timedia. IEEE, 2013, pp. 84–91
work page 2013
-
[8]
Scarlet Nazarian, Ben Glover, Hutan Ashrafian, Ara Darzi, and Julian Teare, “Diagnostic accuracy of arti- ficial intelligence and computer-aided diagnosis for the detection and characterization of colorectal polyps: sys- tematic review and meta-analysis,”Journal of medical Internet research, vol. 23, no. 7, pp. e27370, 2021
work page 2021
-
[9]
U-net: Convolutional networks for biomedical image segmentation,
Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Med- ical image computing and computer-assisted interven- tion. Springer, 2015, pp. 234–241
work page 2015
-
[10]
Segformer: Sim- ple and efficient design for semantic segmentation with transformers,
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandku- mar, Jose M Alvarez, and Ping Luo, “Segformer: Sim- ple and efficient design for semantic segmentation with transformers,”Advances in neural information process- ing systems, vol. 34, pp. 12077–12090, 2021
work page 2021
-
[11]
Zhuoyu Wu, Qinchen Wu, Wenqi Fang, Wenhui Ou, Quanjun Wang, Linde Zhang, Chao Chen, Zheng Wang, and Heshan Li, “Harmonizing unets: Attention fu- sion module in cascaded-unets for low-quality oct im- age fluid segmentation,”Computers in Biology and Medicine, vol. 183, pp. 109223, 2024
work page 2024
-
[12]
Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Daokun Zhang, Ruili Wang, Rong Qu, and Guoping Qiu, “Cf- former: Cross cnn-transformer channel attention and spatial feature fusion for improved segmentation of low- quality medical images,”Available at SSRN 5243043, 2025
work page 2025
-
[13]
Limiao Li, Keke He, Xiaoyu Zhu, Fangfang Gou, and Jia Wu, “A pathology image segmentation frame- work based on deblurring and region proxy in medical decision-making system,”Biomedical Signal Process- ing and Control, vol. 95, pp. 106439, 2024
work page 2024
-
[14]
I2u- net: A dual-path u-net with rich information interaction for medical image segmentation,
Duwei Dai, Caixia Dong, Qingsen Yan, Yongheng Sun, Chunyan Zhang, Zongfang Li, and Songhua Xu, “I2u- net: A dual-path u-net with rich information interaction for medical image segmentation,”Medical Image Anal- ysis, vol. 97, pp. 103241, 2024
work page 2024
-
[15]
Mobilenetv2: In- verted residuals and linear bottlenecks,
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Mobilenetv2: In- verted residuals and linear bottlenecks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
work page 2018
-
[16]
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Sachin Mehta and Mohammad Rastegari, “Mobilevit: light-weight, general-purpose, and mobile-friendly vi- sion transformer,”arXiv preprint arXiv:2110.02178, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Learning transferable visual models from natural lan- guage supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural lan- guage supervision,” inInternational conference on ma- chine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[18]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Ad- vances in neural information processing systems, vol. 30, 2017
work page 2017
-
[19]
Ctnet: Contrastive transformer network for polyp segmentation,
Bin Xiao, Jinwu Hu, Weisheng Li, Chi-Man Pun, and Xiuli Bi, “Ctnet: Contrastive transformer network for polyp segmentation,”IEEE Transactions on Cybernet- ics, vol. 54, no. 9, pp. 5040–5053, 2024
work page 2024
-
[20]
A novel non-pretrained deep supervision network for polyp segmentation,
Zhenni Yu, Li Zhao, Tangfei Liao, Xiaoqin Zhang, Geng Chen, and Guobao Xiao, “A novel non-pretrained deep supervision network for polyp segmentation,”Pattern Recognition, vol. 154, pp. 110554, 2024
work page 2024
-
[21]
Kvasir-seg: A segmented polyp dataset,
Debesh Jha, Pia H Smedsrud, Michael A Riegler, P ˚al Halvorsen, Thomas De Lange, Dag Johansen, and H˚avard D Johansen, “Kvasir-seg: A segmented polyp dataset,” inInternational conference on multimedia modeling. Springer, 2019, pp. 451–462
work page 2019
-
[22]
Jorge Bernal, F Javier S ´anchez, Gloria Fern ´andez- Esparrach, Debora Gil, Cristina Rodr ´ıguez, and Fer- nando Vilari ˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computerized medical imaging and graphics, vol. 43, pp. 99–111, 2015
work page 2015
-
[23]
Towards automatic polyp detection with a polyp ap- pearance model,
Jorge Bernal, Javier S ´anchez, and Fernando Vilarino, “Towards automatic polyp detection with a polyp ap- pearance model,”Pattern Recognition, vol. 45, no. 9, pp. 3166–3182, 2012
work page 2012
-
[24]
Rethinking coarse-to-fine ap- proach in single image deblurring,
Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko, “Rethinking coarse-to-fine ap- proach in single image deblurring,” inProceedings of the IEEE/CVF international conference on computer vi- sion, 2021, pp. 4641–4650
work page 2021
-
[25]
Rt-focuser: A real-time lightweight model for edge-side image deblurring,
Zhuoyu Wu, Wenhui Ou, Qiawei Zheng, Jiayan Yang, Quanjun Wang, Wenqi Fang, Zheng Wang, Yongkui Yang, and Heshan Li, “Rt-focuser: A real-time lightweight model for edge-side image deblurring,” in 2025 IEEE International Conference on Integrated Cir- cuits, Technologies and Applications (ICTA). IEEE, 2025, pp. 255–256
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.