arxiv: 2604.27383 · v1 · submitted 2026-04-30 · 📡 eess.IV · cs.CV

Recognition: unknown

A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation

Yang Zhou , Chaoyong Zhang , Ruoyi Hao , Huilin Pan , Yang Zhang , Hongliang Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:24 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords glottis segmentationnasotracheal intubationreal-time segmentationscale-robust networklightweight neural networkmedical image segmentation

0 comments

The pith

A lightweight network segments the glottis reliably during real-time nasotracheal intubation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a neural network tailored for glottis segmentation in videos used to guide nasotracheal intubation. The glottis starts small and can expand to fill the entire view, while lighting and anatomy make detection hard, and previous tools are too slow or inaccurate for portable use. The solution stacks a multi-receptive field module that extracts features across scales to handle size variation and uses a revised way to assign labels in training. Tests show the resulting compact network reaches 92.9 percent mDice accuracy, runs faster than 170 frames per second, and uses only 19 megabytes of memory, outperforming earlier methods on three datasets.

Core claim

The authors claim that their lightweight multi-receptive field feature extraction module, when stacked to build the network backbone and neck, combined with an advanced label assignment strategy that redefines the number of samples, successfully reduces intra-class differences and yields scale-robust, high-accuracy glottis segmentation suitable for real-time vision-assisted NTI.

What carries the argument

A stacked lightweight multi-receptive field feature extraction module that processes features at multiple scales to reduce intra-class differences, together with a redefined label assignment method.

If this is right

The approach supports real-time glottis segmentation on portable medical devices during intubation procedures.
It maintains accuracy across large variations in glottis scale and difficult imaging conditions.
The model size of 19 MB and speed above 170 fps make it practical for clinical deployment.
It outperforms state-of-the-art segmentation algorithms on three separate datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multi-receptive field idea could extend to other endoscopic tasks where structures change size dramatically.
The label assignment technique may benefit other lightweight detection networks in medical imaging.
Clinical trials in operating rooms could verify if the speed gains actually shorten intubation times or improve success rates.

Load-bearing premise

The multi-receptive field module and redefined label assignment will keep reducing intra-class differences when applied to new patient anatomies, lighting conditions, and motion artifacts not present in the three test datasets.

What would settle it

A significant drop in segmentation accuracy on a fourth dataset collected under different clinical conditions with unseen variations would show the method lacks the claimed robustness.

read the original abstract

Nasotracheal intubation (NTI) is a critical clinical procedure for establishing and maintaining patient airway patency. Machine-assisted NTI has emerged as a pivotal approach for optimizing procedural efficiency and minimizing manual intervention. However, visual detection algorithms employed for NTI navigation encounter significant challenges, including complex anatomical environments and suboptimal illumination conditions surrounding the glottis. Additionally, the glottis presents considerable scale variability throughout the procedure, initially appearing as a small, difficult-to-capture structure before expanding to occupy nearly the entire field of view. Moreover, traditional visual detection methods often have high computational costs, making real-time, high-precision detection on portable devices challenging. To enhance NTI efficacy and address these challenges, this paper proposes a novel glottis segmentation framework optimized for vision-assisted NTI applications. First, we designed a lightweight, multi-receptive field feature extraction module to reduce intra-class differences, achieving robustness to scale variations of the glottis. This module was then stacked to form the backbone and neck of our network. Subsequently, we developed an advanced label assignment method and redefined the number of samples to further reduce intra-class differences and enhance accuracy in the complex NTI environment. Experiments on three distinct datasets demonstrate that our network surpasses state-of-the-art algorithms, achieving a segmentation mDice of 92.9\% with a compact model size of 19 MB and an inference speed exceeding 170 frames per second. % Our code and datasets will be open-sourced on GitHub after the manuscript is accepted. Our code and datasets are available at https://github.com/HBUT-CV/GlottisNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical lightweight glottis segmenter for nasotracheal intubation that reports strong speed and accuracy on three datasets but leaves generalization untested.

read the letter

This paper puts forward a compact network for glottis segmentation during nasotracheal intubation. It stacks a multi-receptive field module into the backbone and neck to cope with the glottis changing size from small to filling the view, then adds a custom label assignment step to lower intra-class differences in poor lighting and complex anatomy. The result is a 19 MB model that runs above 170 fps and reaches 92.9% mDice on three datasets, beating the baselines shown.

Referee Report

2 major / 2 minor

Summary. The paper proposes a lightweight neural network for real-time glottis segmentation in nasotracheal intubation (NTI) videos. It introduces a multi-receptive field feature extraction module, stacked to form the backbone and neck, to achieve scale robustness by reducing intra-class differences, combined with an advanced label assignment strategy that redefines the number of samples. The model is evaluated on three distinct datasets and is claimed to surpass state-of-the-art methods, achieving 92.9% mean Dice coefficient with a 19 MB model size and inference speed exceeding 170 frames per second.

Significance. If the performance claims hold after rigorous validation, the work could have meaningful clinical impact by enabling efficient, real-time vision assistance for NTI procedures on portable devices under challenging anatomical, illumination, and scale conditions. The emphasis on compactness and speed addresses a practical barrier in medical imaging deployment, and the focus on scale variability is well-motivated for this application.

major comments (2)

[Experiments] Experiments section: The manuscript reports a segmentation mDice of 92.9% and superiority over SOTA on three datasets but provides no details on dataset characteristics (e.g., image counts, patient diversity, annotation protocols), baseline implementations, ablation studies, or statistical testing. This omission is load-bearing for the central claim, as it prevents confirmation that gains arise from the multi-receptive field module and redefined label assignment rather than dataset-specific factors or post-hoc tuning.
[Method] Method section: The multi-receptive field module and label assignment are presented as key to reducing intra-class differences for scale variability, yet the evaluation lacks cross-dataset or external validation on unseen clinical variations (anatomies, lighting, motion artifacts). Since the weakest assumption and central claim of robustness in complex NTI scenes depend on this generalization, additional testing is required to substantiate the modules' effectiveness beyond the three internal datasets.

minor comments (2)

[Abstract] Abstract: A stray LaTeX comment ('% Our code...') appears in the text and should be removed for a clean final version.
The GitHub link for code and datasets is provided; ensure it is functional and that the promised materials are released upon acceptance to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for strengthening the experimental rigor and validation of our claims regarding scale robustness in glottis segmentation. We have prepared point-by-point responses and will incorporate revisions to address these concerns.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript reports a segmentation mDice of 92.9% and superiority over SOTA on three datasets but provides no details on dataset characteristics (e.g., image counts, patient diversity, annotation protocols), baseline implementations, ablation studies, or statistical testing. This omission is load-bearing for the central claim, as it prevents confirmation that gains arise from the multi-receptive field module and redefined label assignment rather than dataset-specific factors or post-hoc tuning.

Authors: We agree that additional experimental details are essential to substantiate our claims. In the revised manuscript, we will expand the Experiments section with: (1) comprehensive dataset descriptions including total image counts, patient demographics and diversity, acquisition conditions, and annotation protocols (performed by clinical experts with inter-annotator agreement metrics); (2) explicit details on baseline implementations, including sources (official repositories or re-implementations with hyperparameters), training protocols, and hardware used for fair comparison; (3) full ablation studies isolating the contributions of the multi-receptive field module and advanced label assignment strategy, with quantitative results on mDice, model size, and FPS; and (4) statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) across multiple runs to confirm improvements are not due to dataset-specific factors or tuning. These additions will directly link performance gains to our proposed components. revision: yes
Referee: [Method] Method section: The multi-receptive field module and label assignment are presented as key to reducing intra-class differences for scale variability, yet the evaluation lacks cross-dataset or external validation on unseen clinical variations (anatomies, lighting, motion artifacts). Since the weakest assumption and central claim of robustness in complex NTI scenes depend on this generalization, additional testing is required to substantiate the modules' effectiveness beyond the three internal datasets.

Authors: The three datasets in our study were specifically chosen to represent distinct clinical variations in NTI procedures, including differences in anatomical scales, illumination, patient anatomy, and procedural stages. To further validate generalization, we will add cross-dataset experiments in the revised paper: training on combinations of two datasets and evaluating on the held-out third, reporting mDice and other metrics to demonstrate robustness to unseen variations. We will also include qualitative analysis of failure cases related to motion artifacts and lighting, along with a limitations discussion. While fully external multi-center validation would require additional data collection beyond the current scope, the proposed cross-validation and expanded analysis will provide stronger evidence for the modules' effectiveness in complex scenes. revision: partial

Circularity Check

0 steps flagged

No circularity: performance metrics are measured on held-out data, not derived by construction

full rationale

The paper introduces a multi-receptive-field module and a redefined label-assignment strategy whose purpose is stated as reducing intra-class differences for scale robustness. These design choices are then evaluated by training and testing on three distinct datasets, with the central claim being an observed mDice of 92.9% on separate test splits. No equations, loss terms, or parameter-fitting steps are shown that would make the reported segmentation accuracy algebraically equivalent to the module definitions or to any fitted quantity used in the claim. The derivation chain therefore terminates in external empirical measurements rather than in self-referential definitions or self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical validation rather than first-principles derivation. The network weights are learned from data, and success depends on the assumption that the three datasets adequately represent clinical variability.

free parameters (1)

network hyperparameters and weights
All convolutional filters and training hyperparameters are fitted to the training portions of the three datasets.

axioms (1)

domain assumption Convolutional neural networks can extract scale-robust features when trained with appropriate receptive fields and label assignment.
Invoked when claiming the multi-receptive field module reduces intra-class differences.

pith-pipeline@v0.9.0 · 5612 in / 1246 out tokens · 78564 ms · 2026-05-07T10:24:20.789151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 5 canonical work pages

[1]

Develope a lightweight real-time segmentation frame- work that enables rapid detection of multi-scale glottal structures in complex environments, providing precise visual navigation for robot-assisted NTI
[2]

Propose a novel multi-receptive field feature extraction module (LightSRM) that effectively mitigates intra-class variations and maintains robust performance across di- verse glottic scale variations
[3]

Through meticulously designing the sample quantity and implementing a novel label assignment method, we minimize environmental impact and further reduce intra- class differences, thereby improving accuracy
[4]

The remainder of this paper is organized as follows: Sec- tion II introduces the related work on detection and segmenta- tion

Results on three comprehensive datasets show that our method achieves state-of-the-art detection accuracy, while maintaining a compact model size (19 MB) and superior inference speed ( >170 FPS). The remainder of this paper is organized as follows: Sec- tion II introduces the related work on detection and segmenta- tion. Section III details our proposed m...
[5]

These datasets provide a crucial foundation for subsequent investigations

and [16] established comprehensive glottal segmentation datasets utilizing real clinical data. These datasets provide a crucial foundation for subsequent investigations. Despite these AUTHOR et al.: TITLE 3 Fig. 1: The overview of the proposed framework. Our framework includes several parts: the convolution module (ConvModule), the lightweight scale robus...
[6]

For all ablation experiments, we set the batch size to 128 and conduct training for 500 epochs

Implementation Details: We utilize the AdamW optimizer to train GlottisNet with the weight decay parameter set to 0.05. For all ablation experiments, we set the batch size to 128 and conduct training for 500 epochs. We implement a cosine annealing learning rate scheduler, initializing at 0.0005 and gradually attenuating to zero throughout the training pro...
[7]

Datasets: To systematically assess the efficacy and ro- bustness of our method, we conducted comprehensive evalu- ations using three distinct datasets: PID: We developed the Phantom Image Dataset (PID) using phantoms to simulate diverse NTI scenarios as shown in Fig. 5a. The dataset comprises 2,746 images with a resolution of 400×400 pixels, partitioned i...
[8]

Evaluation Metrics: To comprehensively assess object detection performance, we employ standard COCO evaluation metrics [44], specifically the mean Average Precision (mAP) and mAP at an IoU threshold of 0.5 (AP50). Furthermore, to quantitatively evaluate model robustness against scale varia- tions, we incorporate scale-specific average precision metrics: A...
[9]

We employ RTMDet-tiny as our baseline model and evaluate its performance on the PID dataset (Table I)

Overall Network Structure: To systematically evaluate the proposed architecture, we conduct comprehensive ablation studies. We employ RTMDet-tiny as our baseline model and evaluate its performance on the PID dataset (Table I). The first row demonstrates that RTMDet-tiny achieves an mAP of 34.5% with a model size of 84 MB. However, the baseline model suffe...
[10]

LightSRM: To rigorously evaluate the efficacy and novel contributions of the LightSRM architecture, we conduct com- prehensive ablation studies. As illustrated in Table II, the experimental design compares three distinct configurations to isolate the impact of channel attention integration, enabling systematic comparative analysis of model performance. We...
[11]

41.1 62.6 40.1 62.5 56.1 71.9 [1, 2] 44.9 69.4 44.1 68.9 59.5 74.6 [1, 2, 5] 46.0 75.8 44.2 75.0 68.2 81.1 [1, 2, 5, 1, 2] 43.7 68.2 42.9 64.6 53.7 69.9 TABLE IV: Ablation study for the cost matrix. Cost weights Detection(%) Segmentation(%) λ1 λ2 λ3 mAP AP50 mAP AP50 mIoU mDice 3 1 1 54.0 79.6 52.2 81.0 80.0 88.9 1 1 3 51.9 78.1 51.9 78.9 77.1 87.1 3 1 3 ...
[12]

Dilation rate: Considering that dilated convolutions in- herently risk of information loss between sampled points, we implement a cascade of multiple dilated convolution layers with systematically varying dilation rates to maintain an exten- sive receptive field while substantially reducing computational complexity. Following the dilation rate configurati...
[13]

Using our best-performing model configuration (Table I, last row), we conduct comprehensive ablation studies on the PID dataset (Table IV)

Cost Matrix: We evaluate the effectiveness of the cost matrix by analyzing its impact on model performance met- rics. Using our best-performing model configuration (Table I, last row), we conduct comprehensive ablation studies on the PID dataset (Table IV). The initial experiment optimizes the classification cost while maintaining the cost matrix. This op...
[14]

Number of Samples: Our label assignment method dy- namically matches positive and negative samples within a predefined number of total samples. Therefore, we conduct an ablation study on the required number of positive samples (TopK) by utilizing a cost matrix optimized with weights [3, 1, 3] and initialized with 13 positive samples. As shown in Table V, ...

2021
[15]

When evaluated using the AP50 metric, all SOTA methods exhibit satisfactory performance on the structurally simpler BAGLS dataset

Accuracy: As illustrated in Table VI, GlottisNet demon- strates superior detection performance with photometric dis- tortion preprocessing, achieving mAP scores of 57.8%, 63.1%, and 37.2% on these datasets, respectively, thereby surpassing all current SOTA methods. When evaluated using the AP50 metric, all SOTA methods exhibit satisfactory performance on ...
[16]

Model size and FPS: The quantitative analysis presented in Table VII indicates that GlottisNet achieves significant pa- rameter efficiency with a compact size of 19 MB, representing an 8-fold reduction compared to the baseline architecture. With a fixed input resolution of 400 × 400 pixels, the GlottisNet model demonstrates excellent inference performance...
[17]

5 and to validate its efficacy in NTI scenar- ios, we conducted a stratified performance analysis employing scale-specific COCO metrics on the PID dataset

Robustness Analysis: To quantitatively substantiate the robustness of the proposed framework against scale variations visualized in Fig. 5 and to validate its efficacy in NTI scenar- ios, we conducted a stratified performance analysis employing scale-specific COCO metrics on the PID dataset. We selected this dataset for the analysis because it simulates t...
[18]

2022 american society of anesthesiologists practice guidelines for management of the difficult airway,

J. L. Apfelbaum, C. A. Hagberg, R. T. Connis, B. B. Abdelmalak, M. Agarkar, R. P. Dutton, J. E. Fiadjoe, R. Greif, P. A. Klock, D. Mercier et al. , “2022 american society of anesthesiologists practice guidelines for management of the difficult airway,” Anesthesiology, vol. 136, no. 1, pp. 31–81, 2022

2022
[19]

Improv- ing foundation model for endoscopy video analysis via representation learning on long sequences,

Z. Wang, C. Liu, L. Zhu, T. Wang, S. Zhang, and Q. Dou, “Improv- ing foundation model for endoscopy video analysis via representation learning on long sequences,” IEEE Journal of Biomedical and Health Informatics, 2025

2025
[20]

Interactive ct- video registration for the continuous guidance of bronchoscopy,

S. A. Merritt, R. Khare, R. Bascom, and W. E. Higgins, “Interactive ct- video registration for the continuous guidance of bronchoscopy,” IEEE Transactions on Medical Imaging , vol. 32, no. 8, pp. 1376–1396, 2013

2013
[21]

Rtmdet: An empirical study of designing real-time object detectors,

C. Lyu, W. Zhang, H. Huang, Y . Zhou, Y . Wang, Y . Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” 2022

2022
[22]

Analysis of laryngeal high-speed videoendoscopy recordings – roi detection,

T. Ettler and P. Nov ´y, “Analysis of laryngeal high-speed videoendoscopy recordings – roi detection,” Biomedical Signal Processing and Control , vol. 78, p. 103854, 2022. 14 IEEE TRANSACTIONS AND JOURNALS TEMPLATE

2022
[23]

Sim- to-real transfer of soft robotic navigation strategies that learns from the virtual eye-in-hand vision,

J. Lai, T.-A. Ren, W. Yue, S. Su, J. Y . K. Chan, and H. Ren, “Sim- to-real transfer of soft robotic navigation strategies that learns from the virtual eye-in-hand vision,”IEEE Transactions on Industrial Informatics, vol. 20, no. 2, pp. 2365–2377, 2024

2024
[24]

Two step convolutional neural network for automatic glottis localization and segmentation in stroboscopic videos,

V . Belagali, A. Rao M V , P. Gopikishore, R. Krishnamurthy, and P. K. Ghosh, “Two step convolutional neural network for automatic glottis localization and segmentation in stroboscopic videos,”Biomedical Optics Express, vol. 11, no. 8, pp. 4695–4713, 2020

2020
[25]

An original design of remote robot-assisted intubation system,

X. Wang, Y . Tao, X. Tao, J. Chen, Y . Jin, Z. Shan, J. Tan, Q. Cao, and T. Pan, “An original design of remote robot-assisted intubation system,” Scientific Reports, vol. 8, no. 1, 2018

2018
[26]

Automatic endoscopic nav- igation based on attention-based network for nasotracheal intubation,

Z. Deng, X. Wei, X. Zheng, and B. He, “Automatic endoscopic nav- igation based on attention-based network for nasotracheal intubation,” Biomedical Signal Processing and Control , vol. 86, p. 105035, 2023

2023
[27]

Fees-is: Real-time instance segmentation of flexible endoscopic evaluation of swallowing,

W. Weng, X. Zhu, M. Imaizumi, and S. Murono, “Fees-is: Real-time instance segmentation of flexible endoscopic evaluation of swallowing,” in 2023 11th European Workshop on Visual Information Processing (EUVIP), 2023, Conference Proceedings, p. 1–6

2023
[28]

Expert-level aspi- ration and penetration detection during flexible endoscopic evaluation of swallowing with artificial intelligence-assisted diagnosis,

W. Weng, M. Imaizumi, S. Murono, and X. Zhu, “Expert-level aspi- ration and penetration detection during flexible endoscopic evaluation of swallowing with artificial intelligence-assisted diagnosis,” Scientific Reports, vol. 12, no. 1, 2022

2022
[29]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , N. Navab, J. Horneg- ger, W. M. Wells, and A. F. Frangi, Eds. Springer International Publishing, 2015, Conference Proceedings, pp. 234–241

2015
[30]

Unet++: A nested u-net architecture for medical image segmentation,

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Grana...

2018
[31]

Springer, 2018, pp. 3–11

2018
[32]

Unet 3+: A full-scale connected unet for medical image segmentation,

H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y . Iwamoto, X. Han, Y .- W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for medical image segmentation,” 2020/05 2020

2020
[33]

A dataset of laryngeal endoscopic images with comparative study on convolution neural network-based semantic segmentation,

M.-H. Laves, J. Bicker, L. A. Kahrs, and T. Ortmaier, “A dataset of laryngeal endoscopic images with comparative study on convolution neural network-based semantic segmentation,” International journal of computer assisted radiology and surgery , vol. 14, pp. 483–492, 2019

2019
[34]

Bagls, a multihospital benchmark for automatic glottis segmentation,

P. G ´omez, A. M. Kist, P. Schlegel, D. A. Berry, D. K. Chhetri, S. D ¨urr, M. Echternach, A. M. Johnson, S. Kniesburges, M. Kunduk, Y . Maryn, A. Sch¨utzenberger, M. Verguts, and M. D¨ollinger, “Bagls, a multihospital benchmark for automatic glottis segmentation,” Scientific Data, vol. 7, no. 1, p. 186, 2020

2020
[35]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in IEEE International Conference on Computer Vision , 2017, pp. 2980–2988

2017
[36]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, no. 6, pp. 1137–1149, 2017

2017
[37]

Cascade r-cnn: High quality object detection and instance segmentation,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: High quality object detection and instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1483–1498, 2021

2021
[38]

Conditional convolutions for instance segmentation,

Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in European Conference on Computer Vision , 2020, pp. 282–298

2020
[39]

Boxinst: High-performance instance segmentation with box annotations,

Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: High-performance instance segmentation with box annotations,” in IEEE Conference on Computer Vision and Pattern Recognition , 2021, pp. 5439–5448

2021
[40]

YOLACT: real-time instance segmentation,

D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “YOLACT: real-time instance segmentation,” in IEEE International Conference on Computer Vision , 2019, pp. 9156–9165

2019
[41]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

2023
[42]

Yolov9: Learning what you want to learn us- ing programmable gradient information

C.-Y . Wang and H.-Y . M. Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” arXiv preprint arXiv:2402.13616, 2024

work page arXiv 2024
[43]

Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

A. Wang, H. Chen, L. Liu et al., “Yolov10: Real-time end-to-end object detection,” arXiv preprint arXiv:2405.14458 , 2024

work page arXiv 2024
[44]

Ultralytics yolo11,

G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics

2024
[45]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in European Conference on Computer Vision , 2020, pp. 213–229

2020
[46]

Conditional detr for fast training convergence,

D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y . Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in IEEE/CVF International Conference on Computer Vision , 2021, pp. 3631–3640

2021
[47]

DAB-DETR: Dynamic anchor boxes are better queries for DETR,

S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in International Conference on Learning Representations , 2022, pp. 1–20

2022
[48]

Deformable DETR: Deformable Transformers for End-to-End Object Detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations , 2021

2021
[49]

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 21 002–21 012

2020
[50]

Solov2: Dynamic and fast instance segmentation,

X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “Solov2: Dynamic and fast instance segmentation,” in Neural Information Processing Systems , 2020, pp. 1–17

2020
[51]

Sparse instance activation for real-time instance seg- mentation,

T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time instance seg- mentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4423–4432

2022
[52]

Bridging the Gap Be- tween Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection,

S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the Gap Be- tween Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9756–9765

2020
[53]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[54]

Segment anything in medical images,

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature communications, vol. 15, no. 1, p. 654, 2024

2024
[55]

Fast segment anything,

X. Zhao, W. Ding, Y . An, Y . Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,” arXiv preprint arXiv:2306.12156 , 2023

work page arXiv 2023
[56]

Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,

K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. Shi, “Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,” IEEE Transactions on Geoscience and Remote Sensing , vol. 62, pp. 1–17, 2024

2024
[57]

Medical sam 2: Segment medical images as video via segment anything model 2,

J. Zhu, A. Hamdi, Y . Qi, Y . Jin, and J. Wu, “Medical sam 2: Segment medical images as video via segment anything model 2,” arXiv preprint arXiv:2408.00874, 2024

work page arXiv 2024
[58]

Medical sam adapter: Adapting segment anything model for medical image segmentation,

J. Wu, Z. Wang, M. Hong, W. Ji, H. Fu, Y . Xu, M. Xu, and Y . Jin, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025

2025
[59]

Understanding convolution for semantic segmentation,

P. Wang, P. Chen, Y . Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” inIEEE Winter Conference on Applications of Computer Vision (WACV) , 2018, pp. 1451–1460

2018
[60]

Pytorch library for cam methods,

J. Gildenblat and contributors, “Pytorch library for cam methods,” https: //github.com/jacobgil/pytorch-grad-cam, 2021

2021
[61]

MMDetection: Open MMLab Detection Toolbox and Benchmark

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155 , 2019

work page Pith review arXiv 1906
[62]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision , 2014, pp. 740– 755

2014
[63]

Uaal dataset: Upper airway anatomical landmark dataset for automated bronchoscopy and intubation,

R. Hao, Y . Zhang, Z. Tang, Y . Zhou, L. Seenivasan, C. P. L. Chan, J. Y . K. Chan, S. Xu, N. W. Y . Teo, K. Tay, V . Y . J. Tan, J. F. Thong, K. L. Kiong, S. Loh, S. T. Toh, C. M. Lim, and H. Ren, “Uaal dataset: Upper airway anatomical landmark dataset for automated bronchoscopy and intubation,” figshare. Journal contribution., pp. 14 454–14 463, 2024

2024
[64]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 1280–1289

2022
[65]

Mask scoring R-CNN,

Z. Huang, L. Huang, Y . Gong, C. Huang, and X. Wang, “Mask scoring R-CNN,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6402 – 6411

2019
[66]

Pointrend: Image segmen- tation as rendering,

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image segmen- tation as rendering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9796–9805

2020
[67]

Instances as queries,

Y . Fang, S. Yang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, and W. Liu, “Instances as queries,” in IEEE International Conference on Computer Vision, 2021, pp. 6910–6919

2021