DeepAAA: clinically applicable and generalizable detection of abdominal aortic aneurysm using deep learning
Pith reviewed 2026-05-25 08:45 UTC · model grok-4.3
The pith
A modified 3D U-Net with ellipse fitting detects abdominal aortic aneurysms on CT scans with sensitivity 0.91 and specificity 0.95 internally and 0.85/1.0 externally.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the DeepAAA model, built from a modified 3D U-Net combined with ellipse fitting, performs aorta segmentation and AAA detection at high sensitivity and specificity on both an internal validation set and an external test set drawn from different demographics and acquisition settings, while also exceeding literature-reported radiologist performance on incidental AAA detection.
What carries the argument
Modified 3D U-Net combined with ellipse fitting that performs aorta segmentation and AAA detection
If this is right
- The model works on both contrast and non-contrast CT scans.
- It processes image volumes that contain varying numbers of slices.
- Performance is maintained on data drawn from different patient demographics and acquisition characteristics.
- The system can function as a background detector in routine CT examinations to flag incidental AAAs.
Where Pith is reading between the lines
- If integrated into radiology reading workflows, the model could lower the rate at which incidental AAAs are overlooked during interpretation of non-vascular CT studies.
- The segmentation-plus-ellipse approach might be adapted to quantify aneurysm size or growth rate on serial scans without additional retraining.
- Expanding the external test to multi-center data would provide a stronger check on whether performance holds across scanner manufacturers and regional populations.
Load-bearing premise
The 57-examination external test set with differing demographics and acquisition characteristics is assumed to provide a sufficient test of generalizability to real-world clinical use.
What would settle it
A drop in sensitivity or specificity below the reported levels when the same model is run on a substantially larger external set that includes more varied scanner vendors, patient body sizes, or contrast protocols would falsify the generalizability claim.
read the original abstract
We propose a deep learning-based technique for detection and quantification of abdominal aortic aneurysms (AAAs). The condition, which leads to more than 10,000 deaths per year in the United States, is asymptomatic, often detected incidentally, and often missed by radiologists. Our model architecture is a modified 3D U-Net combined with ellipse fitting that performs aorta segmentation and AAA detection. The study uses 321 abdominal-pelvic CT examinations performed by Massachusetts General Hospital Department of Radiology for training and validation. The model is then further tested for generalizability on a separate set of 57 examinations with differing patient demographics and acquisition characteristics than the original dataset. DeepAAA achieves high performance on both sets of data (sensitivity/specificity 0.91/0.95 and 0.85 / 1.0 respectively), on contrast and non-contrast CT scans and works with image volumes with varying numbers of images. We find that DeepAAA exceeds literature-reported performance of radiologists on incidental AAA detection. It is expected that the model can serve as an effective background detector in routine CT examinations to prevent incidental AAAs from being missed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepAAA, a modified 3D U-Net architecture combined with ellipse fitting for aorta segmentation and AAA detection/quantification on CT volumes. It reports sensitivity/specificity of 0.91/0.95 on an internal set of 321 MGH examinations and 0.85/1.0 on a separate external set of 57 examinations with differing demographics and acquisition characteristics. The model is stated to operate on both contrast and non-contrast scans with variable slice counts and to exceed literature-reported radiologist performance on incidental AAA detection, with the expectation that it can serve as a background detector in routine practice.
Significance. If the performance numbers hold under more rigorous external validation, the work would be significant for reducing missed incidental AAAs, a condition responsible for over 10,000 U.S. deaths annually. The handling of variable scan protocols is a practical strength. However, the small external cohort limits the strength of the generalizability and clinical-applicability claims.
major comments (2)
- [external test set description] External validation set (57 examinations): the central claim of generalizability to real-world clinical use rests on sensitivity 0.85 / specificity 1.0 on this cohort. With n=57 the binomial confidence intervals are necessarily wide and the set cannot capture the full range of scanner vendors, slice thicknesses, contrast protocols, or incidental AAA prevalence encountered in routine practice.
- [results and discussion of clinical comparison] Radiologist benchmark comparison: the claim that DeepAAA exceeds literature-reported radiologist performance on incidental AAA detection is not performed on the same cases, so the superiority statement cannot be directly evaluated.
minor comments (1)
- [abstract] The abstract supplies no information on the internal training/validation split, loss functions, or statistical testing; these details should be summarized even if fully described in the methods.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where they exist and indicating revisions to the manuscript.
read point-by-point responses
-
Referee: External validation set (57 examinations): the central claim of generalizability to real-world clinical use rests on sensitivity 0.85 / specificity 1.0 on this cohort. With n=57 the binomial confidence intervals are necessarily wide and the set cannot capture the full range of scanner vendors, slice thicknesses, contrast protocols, or incidental AAA prevalence encountered in routine practice.
Authors: We agree that the external cohort size of 57 limits statistical precision and the breadth of variability captured. This is an inherent constraint of the available data. We will add binomial confidence intervals to the performance metrics in the results section and expand the discussion to explicitly qualify the generalizability claims, noting that further validation on larger, more diverse cohorts is needed. The external set was selected specifically for demographic and technical differences from the internal data, providing an initial test of robustness across contrast/non-contrast and variable slice counts. revision: partial
-
Referee: Radiologist benchmark comparison: the claim that DeepAAA exceeds literature-reported radiologist performance on incidental AAA detection is not performed on the same cases, so the superiority statement cannot be directly evaluated.
Authors: We agree that the comparison relies on literature-reported values rather than the same examinations, precluding a direct head-to-head evaluation. We will revise the abstract, results, and discussion to state that DeepAAA performance exceeds previously reported radiologist detection rates for incidental AAAs in the literature, removing any implication of superiority on identical cases. revision: yes
Circularity Check
No significant circularity; performance metrics derived from independent held-out evaluations
full rationale
The paper trains a modified 3D U-Net on 321 examinations and reports sensitivity/specificity on separate internal validation and external test sets of 57 examinations. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present that reduce the reported results to the training inputs by construction. Evaluation uses held-out data with differing characteristics, satisfying the condition for a self-contained derivation against external benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.