Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review
Pith reviewed 2026-05-21 15:26 UTC · model grok-4.3
The pith
Uncertainty-calibrated and explainable fetal ultrasound AI is technically feasible and regulatorily expected.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After reviewing 78 studies from 2015 to 2026 on automated fetal plane classification paired with explainability or uncertainty quantification, the paper finds a pooled balanced accuracy of 0.93 but notes that only 19 studies reported calibration and 14 reported selective prediction, leading to the proposal of the CALIB-XFUS 22-item reporting framework that covers clinical task, dataset, model pipeline, calibration, explanation, and surveillance to meet regulatory standards.
What carries the argument
The CALIB-XFUS 22-item reporting framework operationalizing calibration, explanation faithfulness, and fairness across six domains for regulated fetal ultrasound AI.
If this is right
- Models would need to include uncertainty quantification and selective prediction to support safe clinical decisions.
- Explanations must be validated by clinicians for faithfulness.
- Post-market surveillance would track performance in real-world settings.
- Fairness audits would ensure equitable performance across populations.
- Compliance with FDA Good Machine Learning Practice and EU AI Act would be facilitated.
Where Pith is reading between the lines
- This approach could extend to other ultrasound or medical imaging classification tasks beyond fetal planes.
- Adoption of the framework might accelerate regulatory approval processes for similar AI tools.
- Researchers could test the framework by applying it retrospectively to existing studies to measure improvements in reporting quality.
Load-bearing premise
The 78 studies provide a representative sample of the field and the gaps in calibration and selective prediction reporting are the main barriers to safe clinical deployment.
What would settle it
A prospective clinical trial demonstrating that fetal ultrasound AI without calibration or explanations leads to higher error rates in plane identification compared to calibrated versions would support the necessity, while the opposite would challenge it.
Figures
read the original abstract
Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a PRISMA 2020 systematic review of 78 studies (2015–2026) on deep learning for fetal ultrasound plane classification that incorporate explainability or uncertainty quantification. It reports a pooled balanced accuracy of 0.93 (95% CI 0.91–0.95) across six standard planes, notes that only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction, and proposes the CALIB-XFUS 22-item reporting framework spanning clinical task, dataset provenance, model pipeline, calibration/selective prediction, explanation faithfulness, and post-market surveillance to support FDA and EU AI Act compliance.
Significance. If the synthesis is representative and the proposed framework gains adoption, the work could help standardize reporting and accelerate safe clinical translation of AI for fetal ultrasound. The pooled accuracy metric offers a useful field benchmark, and the explicit identification of gaps in calibration and selective prediction reporting is a constructive contribution. The feasibility and regulatory-expectation claims, however, would be more robust with direct evidence on integrated implementations.
major comments (1)
- [Abstract] Abstract and Discussion: The claim that 'uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible' is load-bearing for the paper's central argument yet rests on the untested assumption that the separate components (calibration in 19/78 studies, selective prediction in 14/78 studies, plus explanation and fairness) can be combined without performance loss or new failure modes. No breakdown is provided of how many studies simultaneously address calibration, explanation faithfulness validation, and fairness auditing while preserving the reported accuracy levels.
minor comments (1)
- [Abstract] The search end date of April 30, 2026 appears to be a typographical error or projection; please confirm the actual date used for the literature search.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address the major comment below, agreeing where clarification is needed and indicating the revisions made to strengthen the manuscript without overstating the evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract and Discussion: The claim that 'uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible' is load-bearing for the paper's central argument yet rests on the untested assumption that the separate components (calibration in 19/78 studies, selective prediction in 14/78 studies, plus explanation and fairness) can be combined without performance loss or new failure modes. No breakdown is provided of how many studies simultaneously address calibration, explanation faithfulness validation, and fairness auditing while preserving the reported accuracy levels.
Authors: We agree that the original wording could be read as implying seamless integration across all components in existing work, and that a more explicit accounting of overlap would strengthen the argument. Our data extraction focused on individual reporting practices rather than exhaustive cross-tabulation of every possible combination of calibration, selective prediction, explanation faithfulness validation, and fairness auditing. Consequently, we did not quantify the precise number of studies addressing all elements simultaneously while maintaining the reported accuracy. However, the low individual counts (19/78 for calibration; 14/78 for selective prediction) already indicate limited overlap, which is why we developed the CALIB-XFUS framework. Many of the techniques are modular and post-hoc (e.g., temperature scaling for calibration, confidence thresholding for selective prediction, and gradient-based methods for explanation), supporting technical feasibility without requiring joint retraining. We have revised the abstract and discussion to qualify the claim as: the individual components have been demonstrated separately and can be combined using the proposed reporting framework. We have also added a short paragraph in the Discussion explicitly noting the scarcity of fully integrated implementations in the current literature and positioning CALIB-XFUS as a tool to enable such work. revision: yes
Circularity Check
No significant circularity in literature synthesis or framework proposal
full rationale
The paper is a PRISMA-guided systematic review synthesizing findings from 78 external studies on fetal ultrasound plane classification, with pooled balanced accuracy and gap reporting (e.g., only 24% reporting calibration) drawn directly from the reviewed literature rather than any internal equations or self-referential constructions. The proposed CALIB-XFUS 22-item framework is presented as an original operationalization spanning six domains, without reduction to fitted inputs, self-citations, or ansatzes from prior author work. The feasibility and regulatory expectation argument extrapolates from the reviewed studies' existence and partial implementations but does not reduce by construction to the paper's own inputs; no load-bearing self-citation chains, uniqueness theorems, or renamings of known results are present. This constitutes a self-contained analysis against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption PRISMA 2020 guidelines for systematic reviews and meta-analyses
invented entities (1)
-
CALIB-XFUS reporting framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We recommend implementing at least one epistemic uncertainty estimator... Temperature scaling... Conformal prediction for set-valued outputs... Grad-CAM++ produces class-specific attribution maps
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Pooled balanced accuracy... only 19 studies (24%) reported calibration
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. N. Angelopoulos and S. Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511, 2022. doi:10.48550/arXiv.2107.07511
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.07511 2022
-
[2]
C. F. Baumgartner, K. Kamnitsas, J. Matthew, T. Fletcher, S. Smith, L. M. Koch, B. Kainz, D. Rueckert, and B. Glocker. SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound.arXiv preprint arXiv:1612.05601, 2016. doi:10.48550/arXiv.1612.05601
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.05601 2016
-
[3]
K. Borys, Y . A. Schmitt, M. Nauta, C. Seifert, N. Kr¨amer, C. M. Friedrich, and F. Nensa. Explainable AI in medical imaging: an overview for clinical practitioners - beyond saliency-based XAI approaches. European Journal of Radiology, 162:110786, 2023. doi:10.1016/j.ejrad.2023.110786
-
[4]
X. P. Burgos-Artizzu, D. Coronado-Guti ´errez, B. Valenzuela-Alcaraz, E. Bonet-Carne, E. Eixarch, F. Crispi, and E. Gratac ´os. Evaluation of deep convolutional neural networks for automatic classification of common maternal-fetal ultrasound planes.Scientific Reports, 10:10200, 2020. doi:10.1038/s41598-020-67076-5
-
[5]
X. P. Burgos-Artizzu, D. Coronado-Guti ´errez, B. Valenzuela-Alcaraz, E. Bonet-Carne, E. Eixarch, F. Crispi, and E. Gratac´os. FETAL PLANES DB: Common maternal-fetal ultrasound images.Zenodo,
-
[6]
doi:10.5281/zenodo.3904280
-
[7]
A. Chattopadhyay, A. Sarkar, P. Howlader, and V . N. Balasubramanian. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018
work page 2018
-
[8]
T. Chiaburu, F. Haußer, and F. Bießmann. Uncertainty in XAI: human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2):1170–1192, 2024. doi:10.3390/make6020055
-
[9]
J. Fu, T. Lu, S. Zhang, and G. Wang. UM-CAM: uncertainty-weighted multi-resolution class activa- tion maps for weakly-supervised fetal brain segmentation.arXiv preprint arXiv:2306.11490, 2023. doi:10.48550/arXiv.2306.11490
-
[10]
Z. Gao, G. Tan, C. Wang, J. Lin, B. Pu, S. Li, and K. Li. Graph-enhanced ensembles of multi-scale structure perception deep architecture for fetal ultrasound plane recognition.Engineering Applications of Artificial Intelligence, 136:108885, 2024. doi:10.1016/j.engappai.2024.108885
-
[11]
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estima- tion using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1612.01474
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning.arXiv preprint arXiv:1506.02142, 2016. doi:10.48550/arXiv.1506.02142. 7
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02142 2016
-
[13]
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017. doi:10.48550/arXiv.1706.04599
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.04599 2017
-
[14]
A. Harikumar, S. Surendran, and S. Gargi. Explainable AI in deep learning based classifi- cation of fetal ultrasound image planes.Procedia Computer Science, 233:1023–1033, 2024. doi:10.1016/j.procs.2024.03.291
-
[15]
Y . Nagayasu, S. Yamada, R. Mitsuhashi, M. Nunode, M. Sawada, A. Sugimoto, T. Sano, D. Fujita, and M. Ohmichi. Visualisation of assessments of explainable AI: determination of difference between the upper arm and thigh in fetal ultrasound using Grad-CAM.Ultrasound in Obstetrics & Gynecology,
-
[16]
doi:10.1002/uog.25705
-
[17]
P. Pegios, M. Lin, N. Weng, M. B. Søndergaard Svendsen, Z. Bashir, S. Bigdeli, A. N. Christensen, M. Tolsgaard, and A. Feragen. Diffusion-based iterative counterfactual explana- tions for fetal ultrasound image quality assessment.arXiv preprint arXiv:2403.08700, 2024. doi:10.48550/arXiv.2403.08700
-
[18]
R. Rahman, M. G. R. Alam, G. Jeon, M. Z. Uddin, and M. M. Hassan. Demystifying evidential Dempster-Shafer-based feature learning for fetal ultrasound images leveraging fuzzy-contrast enhance- ment and explainable AI.Ultrasonics, 132:107017, 2023. doi:10.1016/j.ultras.2023.107017
-
[19]
M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. doi:10.1145/2939672.2939778
-
[20]
A. Sivasubramanian, D. Sasidharan, V . Sowmya, and V . Ravi. Efficient feature extraction using light- weight CNN attention-based deep learning architectures for ultrasound fetal plane classification.arXiv preprint arXiv:2410.17396, 2024. doi:10.48550/arXiv.2410.17396
-
[21]
M. Testi, M. C. Fiorentino, M. Ballabio, G. Visani, M. Ciccozzi, E. Frontoni, S. Moccia, and G. Vessio. FetalMLOps: operationalizing machine learning models for standard fetal ultrasound plane classifica- tion.Medical & Biological Engineering & Computing, 2025. doi:10.1007/s11517-025-03436-5. 8 Figure 1: A minimal clinical workflow for uncertainty-calibra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.