arxiv: 2604.24146 · v1 · submitted 2026-04-27 · 💻 cs.CV

EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

Xuguang Bai , Mingxuan Liu , Tongxi Song , Yifei Chen , Hongjia Yang , Kasidit Anmahapong , Zihan Li , Ying Zhou

show 1 more author

Qiyuan Tian

This is my paper

Pith reviewed 2026-05-08 04:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D chest CTanomaly detectionfoundation modelweak supervisionexplainable AImedical visionradiology reportsvoxel-level analysis

0 comments p. Extension

The pith

EXACT learns organ-specific voxel anomaly scores for 3D chest CT from scan-report pairs using weak supervision to enable more accurate and interpretable analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EXACT, a vision foundation model designed for three-dimensional chest CT analysis that focuses on explainability and anomaly awareness. It pre-trains on thousands of paired CT scans and radiology reports by applying anatomy-aware weak supervision to learn organ segmentation alongside anomaly localization at the voxel scale. This produces maps that score each voxel for disease-specific anomalies while respecting anatomical boundaries. Such a system matters because volumetric CT data is complex and current AI often lacks spatial detail or transparency, which can hinder clinical trust and utility. The reported evaluations across multiple countries and centers indicate consistent outperformance on tasks including disease diagnosis, anomaly localization without training examples, model adaptation, and generating reports backed by visual evidence.

Core claim

EXACT was pre-trained on 25,692 CT-report pairs using anatomy-aware weak supervision to jointly learn organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-

What carries the argument

Anatomy-aware weak supervision from free-text radiology reports enabling joint organ segmentation and multi-instance anomaly localization to generate confined voxel-level disease anomaly scores.

Load-bearing premise

Supervision derived from free-text radiology reports is detailed and accurate enough to train models that produce reliable voxel-level anomaly scores generalizable across centers and disease distributions.

What would settle it

If a comparison on a dataset with expert-annotated voxel masks reveals that the predicted anomaly scores do not align well with the actual lesion locations or if the model shows poor transfer to new disease types not represented in training reports.

read the original abstract

Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EXACT claims to deliver organ-specific anomaly maps for 3D chest CT from report supervision alone, but the abstract supplies no metrics or ablations so the performance gains cannot be checked.

read the letter

The main takeaway is that this work introduces EXACT, a 3D CT foundation model that jointly produces organ segmentation and disease-specific voxel anomaly maps using anatomy-aware weak supervision on 25k scan-report pairs, without any manual voxel labels. The idea of confining anomaly scores to individual organs while learning from free-text reports is a reasonable step beyond standard global image-text alignment in medical VLMs. It directly targets the need for spatially resolved, interpretable outputs in volumetric radiology, which is a recognized practical barrier. The framing of the clinical tasks—multi-disease diagnosis, zero-shot localization, downstream adaptation, and grounded report generation—is clear and relevant. The multi-center, multinational evaluation setup is also a plus if the details are solid. The soft spots sit mostly in the evidence. The abstract asserts broad outperformance over existing 3D medical foundation models but gives zero quantitative numbers, no error bars, no statistical tests, no ablation results, and no dataset split information. That leaves the central claims uncheckable from what is shown. The weak-supervision assumption—that report text plus coarse anatomy cues can produce reliable, transferable voxel-level anomaly scores—remains the riskiest part, especially for localization and cross-distribution transfer. It could easily latch onto report phrasing or organ context instead of true pathology extent. The stress-test note flags exactly this, and nothing in the provided abstract contradicts it. This paper is for groups working on medical foundation models or explainable volumetric AI. Readers hunting for new pre-training recipes in chest CT could extract useful ideas even if the validation needs strengthening. It deserves a serious referee because the topic is timely and the method is a coherent extension of prior work, though any review will need to focus on the missing quantitative support and the generalization tests. I would send it out for review once the full results are in place.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EXACT, an explainable anomaly-aware vision foundation model for 3D chest CT pre-trained on 25,692 CT-report pairs. It uses anatomy-aware weak supervision to jointly learn organ segmentation and multi-instance anomaly localization without manual voxel-level annotations, producing organ-specific voxel-level disease anomaly maps. Retrospective multinational multi-center evaluations claim consistent outperformance over existing 3D medical foundation models on multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation.

Significance. If the empirical gains are rigorously demonstrated, the work could advance scalable, interpretable volumetric medical AI by deriving spatially resolved representations directly from routine clinical scans and free-text reports. The joint segmentation-anomaly objective and avoidance of manual voxel labels are notable strengths that address limitations of global image-text models. The approach has clear potential for trustworthy clinical tools if localization accuracy and cross-center transfer are validated.

major comments (2)

[Abstract] Abstract: The central claim of 'broad and consistent improvements' and 'outperforming existing three-dimensional medical foundation models' across diagnosis, localization, adaptation, and report generation is asserted without any quantitative metrics, statistical tests, error bars, dataset splits, ablation results, or specific performance numbers. This absence is load-bearing because the entire significance rests on these empirical comparisons.
[Methods] Methods (anatomy-aware weak supervision description): The claim that voxel-level anomaly scores confined to organ anatomy can be reliably learned from report text alone (without manual voxel annotations) requires explicit validation that the scores reflect image-derived pathology extent rather than textual patterns or coarse organ context. This is critical for the zero-shot localization and multi-center generalization components of the main claim.

minor comments (2)

[Experiments] The manuscript would benefit from a dedicated ablation table isolating the contribution of the anatomy-aware component versus standard weak supervision.
Notation for the anomaly score computation and its integration with organ segmentation masks should be formalized with equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment in detail below, providing clarifications based on the content of the full paper and indicating revisions where they strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'broad and consistent improvements' and 'outperforming existing three-dimensional medical foundation models' across diagnosis, localization, adaptation, and report generation is asserted without any quantitative metrics, statistical tests, error bars, dataset splits, ablation results, or specific performance numbers. This absence is load-bearing because the entire significance rests on these empirical comparisons.

Authors: We acknowledge that the abstract, as currently written, summarizes the claims at a high level without embedding specific numerical results. The full manuscript contains extensive quantitative evaluations, including specific metrics, statistical tests, error bars, dataset splits, ablation studies, and performance numbers across all mentioned tasks, reported in the Results section with multi-center and multi-national data. To directly address this point and make the abstract self-contained for readers, we have revised the abstract to incorporate key quantitative highlights and statistical comparisons from our evaluations. revision: yes
Referee: [Methods] Methods (anatomy-aware weak supervision description): The claim that voxel-level anomaly scores confined to organ anatomy can be reliably learned from report text alone (without manual voxel annotations) requires explicit validation that the scores reflect image-derived pathology extent rather than textual patterns or coarse organ context. This is critical for the zero-shot localization and multi-center generalization components of the main claim.

Authors: We agree on the importance of clarifying the source of the anomaly scores. The scores are produced by the 3D vision encoder operating directly on the CT volume, with the anatomy-aware branch providing spatial constraints during joint training; the report text supplies only weak supervision signals rather than dictating the voxel values. The manuscript validates the image-derived nature through zero-shot anomaly localization experiments on held-out multi-center test sets, where the maps are assessed against expert-annotated pathological regions and outperform prior models, as well as through cross-center transfer results that demonstrate robustness beyond training-report distributions. These empirical outcomes on image-based tasks serve as the explicit validation that the representations capture pathology extent in the scans rather than textual artifacts alone. No further revision is needed on this point. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical model with no derivational reductions

full rationale

The paper describes a vision-language foundation model (EXACT) pre-trained on 25,692 CT-report pairs via anatomy-aware weak supervision for organ segmentation and anomaly localization, then evaluated empirically on diagnosis, localization, adaptation, and report generation tasks. No equations, derivations, or first-principles claims are presented that reduce to fitted parameters, self-definitions, or self-citations by construction. All performance claims rest on retrospective multi-center experimental comparisons rather than any load-bearing mathematical chain that collapses to its inputs. This is the standard non-circular outcome for an applied ML paper whose contributions are architectural and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning training assumptions plus the domain assumption that radiology reports contain sufficient spatial signal for voxel-level learning.

axioms (1)

domain assumption Radiology reports provide reliable weak labels for organ-level anomaly localization
Invoked in the pre-training description to justify learning without voxel annotations.

pith-pipeline@v0.9.0 · 5578 in / 1237 out tokens · 40751 ms · 2026-05-08T04:35:35.101354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

& Tian, Q

Bai, X., Liu, M., Chen, Y ., Yang, H. & Tian, Q. Chest-OMDL: Organ-specific Multidisease Detection and Localization in Chest Computed Tomography using Weakly Supervised Deep Learning from Free-text Radiology Report. in Medical Imaging with Deep Learning (2025). 18. Liu, M., Jiao, Y ., Lu, J. & Chen, H. Anomaly Detection for Medical Images Using Teacher-St...

work page doi:10.1109/tim.2024.3406792 2025
[2]

Zhang, S. et al. A Multimodal Biomedical Foundation Model Trained from Fifteen Million Image–Text Pairs. NEJM AI 2, AIoa2400640 (2025). 32. Hamamci, I. E. et al. Generalist foundation models from a multimodal dataset for 3D computed tomography. Nat. Biomed. Eng https://doi.org/10.1038/s41551-025-01599-y (2026) doi:10.1038/s41551-025-01599-y. 33. Shui, Z. ...

work page doi:10.1038/s41551-025-01599-y 2025
[3]

& Zhu, L

Xing, Z., Ye, T., Yang, Y ., Liu, G. & Zhu, L. SegMamba: Long-Range Sequential Modeling Mamba for 3D Medical Image Segmentation. in Medical Image Computing and Computer Assisted Intervention – MICCAI 2024 (eds Linguraru, M. G. et al.) 578–588 (Springer Nature Switzerland, Cham, 2024). 47. Jiang, J. et al. RWKV-UNet: Improving UNet with Long-Range Cooperat...

work page doi:10.48550/arxiv.2501.08458 2024
[4]

Ma, L. et al. A vision–language pretrained transformer for versatile clinical respiratory disease applications. Nat. Biomed. Eng https://doi.org/10.1038/s41551-025-01544-z (2025) doi:10.1038/s41551-025-01544-z. 60. Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Rep...

work page doi:10.1038/s41551-025-01544-z 2025
[5]

GPT-4 Technical Report

OpenAI et al. GPT-4 Technical Report. arXiv e-prints arXiv:2303.08774 (2023) doi:10.48550/arXiv.2303.08774. 74. Liu, A. et al. Automatic intracranial abnormality detection and localization in head CT scans by learning from free-text reports. Cell Reports Medicine 4, 101164 (2023). 75. Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. in I...

work page internal anchor Pith review doi:10.48550/arxiv.2303.08774 2023