arxiv: 2605.09002 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: no theorem link

CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

Lavsen Dahal , Joseph Y. Lo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords abdominal CTquantitative phenotypesorgan segmentationdisease classificationinterpretable modelslogistic regressionexternal validation

0 comments

The pith

Quantitative phenotypes extracted from organ segmentations in abdominal CT scans classify multiple diseases with AUCs matching or exceeding a vision transformer baseline while remaining inspectable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that turns automated multi-organ segmentations into more than 900 numerical descriptors of shape, density, and spatial relationships. These descriptors feed into sparse logistic regression models trained on one large dataset and tested on two others without retraining. The resulting classifiers achieve higher macro-AUC than a strong vision-transformer baseline on every external set. A reader would care because the approach replaces opaque pixel patterns with concrete, auditable measurements that clinicians could trace back to specific organs and tissue properties.

Core claim

CT-IDP generates organ- and compartment-level descriptors spanning morphometry, attenuation, and contextual burden from TotalSegmentator segmentations, then applies elastic-net regularized logistic regression under a frozen specification to produce disease-specific predictions. On the MERLIN benchmark the method records a macro-AUC of 0.897 versus 0.880 for the vision-transformer baseline; the same frozen model yields 0.877 versus 0.857 on Duke-Abdomen and 0.780 versus 0.756 on AMOS. Coefficient inspection and phenotype-stratified audits confirm that the performance edge arises from explicit, human-readable features rather than learned embeddings.

What carries the argument

CT-IDP, the pipeline that converts multi-organ segmentations into sparse, elastic-net logistic regression models whose coefficients directly indicate the contribution of each measurable phenotype to disease probability.

Load-bearing premise

Automated segmentations remain accurate and unbiased enough across institutions that the derived numerical phenotypes retain their disease-discriminating power without systematic distortion.

What would settle it

Re-running the identical frozen models on a new multi-center cohort where expert manual segmentations replace the automated ones and performance falls below the vision-transformer baseline would show the phenotypes are not reliable.

read the original abstract

In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CT-IDP, a framework deriving over 900 quantitative phenotypes (morphometry, attenuation, and burden descriptors) from TotalSegmentator multi-organ segmentations of abdominal CTs. Sparse elastic-net logistic regression models are trained on the MERLIN dataset (15,175/5,018/5,082 train/val/test studies) for disease classification and externally validated under frozen specification on Duke-Abdomen (2,000 studies) and AMOS (1,107 studies). It reports macro-AUC improvements over a DINOv3 vision-transformer baseline (0.897 vs 0.880 on MERLIN; 0.877 vs 0.857 on Duke; 0.780 vs 0.756 on AMOS), supported by phenotype-stratified audits and coefficient inspection for interpretability.

Significance. If the segmentation-derived phenotypes remain disease-discriminative after accounting for tool errors, the work supplies a reproducible, interpretable alternative to end-to-end deep models for multi-institutional CT classification. External validation with frozen models, plus explicit phenotype audits, strengthens reproducibility and offers a path toward clinically auditable predictions; the modest but consistent AUC gains across three datasets indicate practical utility if bias is ruled out.

major comments (2)

[Abstract and Methods] Abstract and Methods: All 900+ phenotypes are derived directly from TotalSegmentator outputs, yet no per-organ Dice, Hausdorff, or volume-error metrics are reported on the diseased subsets of MERLIN, Duke-Abdomen, or AMOS. Pathologies (tumors, ascites, inflammation) routinely distort boundaries and densities; without cohort-specific validation, it is unclear whether the modest AUC gains (e.g., +0.017 on MERLIN) reflect true signal or systematic segmentation bias propagated into the elastic-net models.
[Results] Results and phenotype definition: The manuscript states that full phenotype definitions and any post-hoc selection or filtering steps are provided, but these are not visible in the supplied description; without an exhaustive, reproducible list (including exact formulas for attenuation statistics and burden ratios), independent replication and assessment of potential circularity in phenotype construction cannot be performed.

minor comments (2)

[Abstract] The abstract should explicitly list the disease labels and number of classes underlying the macro-AUC computation to allow immediate assessment of task difficulty.
Tables or supplementary material reporting the top phenotype coefficients per disease would benefit from standardized formatting and confidence intervals to facilitate direct comparison with the DINOv3 baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important aspects of reproducibility and validation. We provide point-by-point responses below and will revise the manuscript accordingly to address the concerns raised.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: All 900+ phenotypes are derived directly from TotalSegmentator outputs, yet no per-organ Dice, Hausdorff, or volume-error metrics are reported on the diseased subsets of MERLIN, Duke-Abdomen, or AMOS. Pathologies (tumors, ascites, inflammation) routinely distort boundaries and densities; without cohort-specific validation, it is unclear whether the modest AUC gains (e.g., +0.017 on MERLIN) reflect true signal or systematic segmentation bias propagated into the elastic-net models.

Authors: We appreciate the referee's emphasis on potential segmentation inaccuracies in pathological cases. TotalSegmentator has been validated on diverse CT datasets including pathologies in its original work and follow-up studies, but we acknowledge that explicit per-organ Dice, Hausdorff, and volume-error metrics on the diseased subsets of our specific cohorts are not reported in the current manuscript. This is a valid limitation that could affect interpretation of the modest AUC improvements. In the revised version, we will add a new subsection in Methods discussing segmentation performance expectations based on published benchmarks, along with a small-scale manual audit of segmentation quality on a random sample of diseased cases from MERLIN. We will also expand the Discussion to address how any residual errors might influence phenotype derivation and model performance. While the consistent gains across external datasets and the use of sparse, interpretable models provide some reassurance against systematic bias, we agree these additions will strengthen the manuscript. revision: yes
Referee: [Results] Results and phenotype definition: The manuscript states that full phenotype definitions and any post-hoc selection or filtering steps are provided, but these are not visible in the supplied description; without an exhaustive, reproducible list (including exact formulas for attenuation statistics and burden ratios), independent replication and assessment of potential circularity in phenotype construction cannot be performed.

Authors: We apologize for the lack of immediate visibility of the full phenotype details in the review materials. The exhaustive list of over 900 phenotypes—including exact formulas for morphometry (e.g., volumes, surface areas), attenuation statistics (mean, standard deviation, percentiles of Hounsfield units within each organ mask), and burden ratios (e.g., compartment involvement fractions)—along with all post-hoc filtering steps, is provided in the Supplementary Materials and the linked public code repository. Phenotypes are constructed solely from segmentation outputs without any use of disease labels, avoiding circularity. To improve accessibility, we will revise the Methods section to include a summary table of phenotype categories with representative formulas and ensure the supplementary file is explicitly referenced in the main text. This will facilitate independent replication and allow direct assessment of the phenotype construction process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external validation and pre-training phenotype derivation keep claims independent

full rationale

The paper generates multi-organ segmentations via TotalSegmentator, derives >900 phenotypes (morphometry, attenuation, burden) from those outputs, trains elastic-net logistic regression on the MERLIN training split, and reports AUC on held-out MERLIN test plus two fully external datasets (Duke-Abdomen, AMOS). No equation or claim reduces a reported performance number to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled via prior work. The central results are therefore falsifiable on independent data and do not collapse to the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the accuracy of an off-the-shelf segmentation tool and the assumption that linear combinations of derived phenotypes suffice for disease discrimination; no new physical entities are postulated.

free parameters (1)

elastic-net regularization parameters
Alpha and l1-ratio hyperparameters selected on the training split to produce sparse models.

axioms (1)

domain assumption TotalSegmentator produces sufficiently accurate multi-organ segmentations for phenotype derivation
Invoked when generating the 900+ descriptors without reported segmentation quality metrics on the study datasets.

pith-pipeline@v0.9.0 · 5508 in / 1255 out tokens · 61019 ms · 2026-05-12T02:44:30.950862+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
cs.CV 2026-05 unverdicted novelty 6.0

JANUS conditions Vision Transformer embeddings on macro-radiomic priors via anatomically guided gating, reaching macro-AUROC 0.88 on an internal test set of 5082 cases and 0.87 on an external set of 2000 cases while i...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

JAMA Intern Med

Smith-Bindman, R., et al., Projected lifetime cancer risks from current computed tomography imaging. JAMA Intern Med. 2025. 2025

work page 2025
[2]

Are we overdoing it? Changes in diagnostic imaging workload during the years 2010–2020 including the impact of the SARS-CoV-2 pandemic

Winder, M., et al. Are we overdoing it? Changes in diagnostic imaging workload during the years 2010–2020 including the impact of the SARS-CoV-2 pandemic. in Healthcare

work page 2010
[3]

European radiology, 2025

Momin, E., et al., Systematic review on the impact of deep learning-driven worklist triage on radiology workflow and clinical outcomes. European radiology, 2025. 35(11): p. 6879–6893

work page 2025
[4]

Medical image analysis, 2021

Draelos, R.L., et al., Machine-learning-based multiple abnormality prediction with large- scale chest computed tomography volumes. Medical image analysis, 2021. 67: p. 101857

work page 2021
[5]

Radiology: Artificial Intelligence, 2021

Tushar, F.I., et al., Classification of multiple diseases on body CT scans using weakly supervised deep learning. Radiology: Artificial Intelligence, 2021. 4(1): p. e210026

work page 2021
[6]

medRxiv, 2025

Beeche, C., et al., A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations. medRxiv, 2025

work page 2025
[7]

arXiv preprint arXiv:2511.17803 (2025)

Agrawal, K.K., et al., Pillar-0: A new frontier for radiology foundation models. arXiv preprint arXiv:2511.17803, 2025

work page arXiv 2025
[8]

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Geirhos, R., et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. in International conference on learning representations. 2018

work page 2018
[9]

Nature Machine Intelligence, 2020

Geirhos, R., et al., Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020. 2(11): p. 665–673

work page 2020
[10]

Academic radiology, 2012

Linguraru, M.G., et al., Assessing hepatomegaly: automated volumetric analysis of the liver. Academic radiology, 2012. 19(5): p. 588–598

work page 2012
[11]

Gücük, A. and U. Üyetürk, Usefulness of hounsfield unit and density in the assessment and treatment of urinary stones. World journal of nephrology, 2014. 3(4): p. 282

work page 2014
[12]

Cancers, 2023

Paudyal, R., et al., Artificial intelligence in CT and MR imaging for oncological applications. Cancers, 2023. 15(9): p. 2573

work page 2023
[13]

Nature reviews Clinical oncology, 2017

Lambin, P., et al., Radiomics: the bridge between medical imaging and personalized medicine. Nature reviews Clinical oncology, 2017. 14(12): p. 749–762

work page 2017
[14]

Journal of Nuclear Medicine, 2020

Mayerhoefer, M.E., et al., Introduction to radiomics. Journal of Nuclear Medicine, 2020. 61(4): p. 488–495

work page 2020
[15]

Clinical pharmacology & therapeutics, 2001

Group, B.D.W., et al., Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clinical pharmacology & therapeutics, 2001. 69(3): p. 89–95

work page 2001
[16]

Radiology, 2015

Sullivan, D.C., et al., Metrology standards for quantitative imaging biomarkers. Radiology, 2015. 277(3): p. 813–825

work page 2015
[17]

Radiology, 2020

Zwanenburg, A., et al., The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology, 2020. 295(2): p. 328–338

work page 2020
[18]

Radiology: Artificial Intelligence, 2023

Wasserthal, J., et al., TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence, 2023. 5(5): p. e230024

work page 2023
[19]

Medical Image Analysis, 2025

Dahal, L., et al., XCAT 3.0: A comprehensive library of personalized digital twins derived from CT scans. Medical Image Analysis, 2025. 103: p. 103636

work page 2025
[20]

Nature methods, 2021

Isensee, F., et al., nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 2021. 18(2): p. 203–211

work page 2021
[21]

arXiv preprint arXiv:2511.11450, 2025

Rokuss, M., et al., Voxtell: Free-text promptable universal 3d medical image segmentation. arXiv preprint arXiv:2511.11450, 2025

work page arXiv 2025
[22]

Research Square, 2024: p

Blankemeier, L., et al., Merlin: A vision language foundation model for 3d computed tomography. Research Square, 2024: p. rs. 3. rs–4546309

work page 2024
[23]

Advances in neural information processing systems, 2022

Ji, Y ., et al., Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. Advances in neural information processing systems, 2022. 35: p. 36722–36732

work page 2022
[24]

DINOv3

Siméoni, O., et al., Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MedGemma Technical Report

Sellergren, A., et al., Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Qwen3 Technical Report

Yang, A., et al., Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Zou, H. and T. Hastie, Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005. 67(2): p. 301–320

work page 2005
[28]

Journal of Medical Imaging, 2020

Abadi, E., et al., Virtual clinical trials in medical imaging: a review. Journal of Medical Imaging, 2020. 7(4): p. 042805–042805

work page 2020
[29]

Medical Image Analysis, 2025

Tushar, F.I., et al., Virtual lung screening trial (VLST): An in silico study inspired by the national lung screening trial for lung cancer detection. Medical Image Analysis, 2025. 103: p. 103576

work page 2025