DeepAAA: clinically applicable and generalizable detection of abdominal aortic aneurysm using deep learning

Brian Ghoshhajra; Gopal Kotecha; Jen-Tang Lu; Jin Chen; Joel Pinto; Katherine P. Andriole; Mark Michalski; Neil A. Tenenholtz; Paul Vozila; Rupert Brooks

arxiv: 1907.02567 · v1 · pith:B56HFZGFnew · submitted 2019-07-04 · 📡 eess.IV · cs.CV

DeepAAA: clinically applicable and generalizable detection of abdominal aortic aneurysm using deep learning

Jen-Tang Lu , Rupert Brooks , Stefan Hahn , Jin Chen , Varun Buch , Gopal Kotecha , Katherine P. Andriole , Brian Ghoshhajra

show 4 more authors

Joel Pinto Paul Vozila Mark Michalski Neil A. Tenenholtz

This is my paper

Pith reviewed 2026-05-25 08:45 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords abdominal aortic aneurysmdeep learningCT detection3D U-Netaorta segmentationincidental findingsmedical imaging

0 comments

The pith

A modified 3D U-Net with ellipse fitting detects abdominal aortic aneurysms on CT scans with sensitivity 0.91 and specificity 0.95 internally and 0.85/1.0 externally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a deep learning system can detect and quantify abdominal aortic aneurysms in routine abdominal-pelvic CT examinations, including cases that are asymptomatic and frequently missed as incidental findings. The model is trained and validated on 321 examinations from one hospital and then evaluated on a separate 57-examination set that differs in patient demographics and scan acquisition. It achieves the reported performance levels on both contrast-enhanced and non-contrast scans and on volumes containing different numbers of slices. The authors note that these results exceed literature-reported radiologist performance for incidental AAA detection and position the model as a potential background tool to reduce missed cases that contribute to more than 10,000 US deaths per year.

Core claim

The central claim is that the DeepAAA model, built from a modified 3D U-Net combined with ellipse fitting, performs aorta segmentation and AAA detection at high sensitivity and specificity on both an internal validation set and an external test set drawn from different demographics and acquisition settings, while also exceeding literature-reported radiologist performance on incidental AAA detection.

What carries the argument

Modified 3D U-Net combined with ellipse fitting that performs aorta segmentation and AAA detection

If this is right

The model works on both contrast and non-contrast CT scans.
It processes image volumes that contain varying numbers of slices.
Performance is maintained on data drawn from different patient demographics and acquisition characteristics.
The system can function as a background detector in routine CT examinations to flag incidental AAAs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If integrated into radiology reading workflows, the model could lower the rate at which incidental AAAs are overlooked during interpretation of non-vascular CT studies.
The segmentation-plus-ellipse approach might be adapted to quantify aneurysm size or growth rate on serial scans without additional retraining.
Expanding the external test to multi-center data would provide a stronger check on whether performance holds across scanner manufacturers and regional populations.

Load-bearing premise

The 57-examination external test set with differing demographics and acquisition characteristics is assumed to provide a sufficient test of generalizability to real-world clinical use.

What would settle it

A drop in sensitivity or specificity below the reported levels when the same model is run on a substantially larger external set that includes more varied scanner vendors, patient body sizes, or contrast protocols would falsify the generalizability claim.

read the original abstract

We propose a deep learning-based technique for detection and quantification of abdominal aortic aneurysms (AAAs). The condition, which leads to more than 10,000 deaths per year in the United States, is asymptomatic, often detected incidentally, and often missed by radiologists. Our model architecture is a modified 3D U-Net combined with ellipse fitting that performs aorta segmentation and AAA detection. The study uses 321 abdominal-pelvic CT examinations performed by Massachusetts General Hospital Department of Radiology for training and validation. The model is then further tested for generalizability on a separate set of 57 examinations with differing patient demographics and acquisition characteristics than the original dataset. DeepAAA achieves high performance on both sets of data (sensitivity/specificity 0.91/0.95 and 0.85 / 1.0 respectively), on contrast and non-contrast CT scans and works with image volumes with varying numbers of images. We find that DeepAAA exceeds literature-reported performance of radiologists on incidental AAA detection. It is expected that the model can serve as an effective background detector in routine CT examinations to prevent incidental AAAs from being missed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small external set of 57 exams undercuts the generalizability claim even though the application itself is straightforward.

read the letter

The main point is that this is a standard 3D U-Net plus ellipse fitting applied to AAA segmentation and detection on CT. They train and validate internally on 321 MGH exams and then test on a separate 57-exam set with different demographics and scan parameters, reporting 0.91/0.95 sensitivity/specificity inside and 0.85/1.0 outside. The model runs on both contrast and non-contrast volumes and handles varying slice counts. They also note it beats published radiologist rates for incidental detection. That external test and the practical workflow angle are the concrete new pieces here. The architecture itself is not novel. The work is useful for showing that an established model can be made to run on real hospital data with some degree of robustness to contrast and volume size. The soft spot is exactly the one flagged in the stress test. Fifty-seven exams is too small to support broad claims of generalizability; the confidence intervals around those point estimates will be wide, and the set cannot capture the range of vendors, protocols, and prevalence seen in routine practice. The radiologist comparison is also to literature numbers rather than the same cases, so the superiority statement rests on weaker ground. The abstract gives almost no information on splits, loss, or statistical testing, which makes the numbers harder to evaluate. This paper is aimed at radiologists and medical imaging groups who want a deployable background detector for missed AAAs. Readers working on clinical translation would get value from the external numbers and the handling of mixed contrast scans. It deserves peer review because the clinical problem matters and they did attempt external validation, even if the set is limited. Reviewers will likely ask for more data or tighter statistical reporting, but the work is coherent enough to go out rather than be desk-rejected.

Referee Report

2 major / 1 minor

Summary. The paper introduces DeepAAA, a modified 3D U-Net architecture combined with ellipse fitting for aorta segmentation and AAA detection/quantification on CT volumes. It reports sensitivity/specificity of 0.91/0.95 on an internal set of 321 MGH examinations and 0.85/1.0 on a separate external set of 57 examinations with differing demographics and acquisition characteristics. The model is stated to operate on both contrast and non-contrast scans with variable slice counts and to exceed literature-reported radiologist performance on incidental AAA detection, with the expectation that it can serve as a background detector in routine practice.

Significance. If the performance numbers hold under more rigorous external validation, the work would be significant for reducing missed incidental AAAs, a condition responsible for over 10,000 U.S. deaths annually. The handling of variable scan protocols is a practical strength. However, the small external cohort limits the strength of the generalizability and clinical-applicability claims.

major comments (2)

[external test set description] External validation set (57 examinations): the central claim of generalizability to real-world clinical use rests on sensitivity 0.85 / specificity 1.0 on this cohort. With n=57 the binomial confidence intervals are necessarily wide and the set cannot capture the full range of scanner vendors, slice thicknesses, contrast protocols, or incidental AAA prevalence encountered in routine practice.
[results and discussion of clinical comparison] Radiologist benchmark comparison: the claim that DeepAAA exceeds literature-reported radiologist performance on incidental AAA detection is not performed on the same cases, so the superiority statement cannot be directly evaluated.

minor comments (1)

[abstract] The abstract supplies no information on the internal training/validation split, loss functions, or statistical testing; these details should be summarized even if fully described in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where they exist and indicating revisions to the manuscript.

read point-by-point responses

Referee: External validation set (57 examinations): the central claim of generalizability to real-world clinical use rests on sensitivity 0.85 / specificity 1.0 on this cohort. With n=57 the binomial confidence intervals are necessarily wide and the set cannot capture the full range of scanner vendors, slice thicknesses, contrast protocols, or incidental AAA prevalence encountered in routine practice.

Authors: We agree that the external cohort size of 57 limits statistical precision and the breadth of variability captured. This is an inherent constraint of the available data. We will add binomial confidence intervals to the performance metrics in the results section and expand the discussion to explicitly qualify the generalizability claims, noting that further validation on larger, more diverse cohorts is needed. The external set was selected specifically for demographic and technical differences from the internal data, providing an initial test of robustness across contrast/non-contrast and variable slice counts. revision: partial
Referee: Radiologist benchmark comparison: the claim that DeepAAA exceeds literature-reported radiologist performance on incidental AAA detection is not performed on the same cases, so the superiority statement cannot be directly evaluated.

Authors: We agree that the comparison relies on literature-reported values rather than the same examinations, precluding a direct head-to-head evaluation. We will revise the abstract, results, and discussion to state that DeepAAA performance exceeds previously reported radiologist detection rates for incidental AAAs in the literature, removing any implication of superiority on identical cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance metrics derived from independent held-out evaluations

full rationale

The paper trains a modified 3D U-Net on 321 examinations and reports sensitivity/specificity on separate internal validation and external test sets of 57 examinations. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present that reduce the reported results to the training inputs by construction. Evaluation uses held-out data with differing characteristics, satisfying the condition for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; model training implicitly relies on standard deep-learning assumptions not enumerated here.

pith-pipeline@v0.9.0 · 5775 in / 1061 out tokens · 30961 ms · 2026-05-25T08:45:49.337604+00:00 · methodology

DeepAAA: clinically applicable and generalizable detection of abdominal aortic aneurysm using deep learning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)