pith. sign in

arxiv: 2601.00990 · v2 · pith:OV4JENARnew · submitted 2026-01-02 · 📡 eess.IV · cs.CV

Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review

Pith reviewed 2026-05-21 15:26 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords fetal ultrasoundplane classificationexplainable AIuncertainty calibrationsystematic reviewCALIB-XFUSregulatory compliancemedical AI
0
0 comments X

The pith

Uncertainty-calibrated and explainable fetal ultrasound AI is technically feasible and regulatorily expected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fetal ultrasound plane classification is essential for antenatal care, and deep learning models have achieved high accuracy on standard planes. A systematic review of 78 studies reveals that while balanced accuracy reaches 0.93, only about a quarter report calibration and fewer report selective prediction. The authors introduce the CALIB-XFUS framework with 22 items to standardize reporting on calibration, explanation faithfulness, and fairness. This positions such AI systems as ready for clinical use under current FDA and EU regulatory guidelines for high-risk applications.

Core claim

After reviewing 78 studies from 2015 to 2026 on automated fetal plane classification paired with explainability or uncertainty quantification, the paper finds a pooled balanced accuracy of 0.93 but notes that only 19 studies reported calibration and 14 reported selective prediction, leading to the proposal of the CALIB-XFUS 22-item reporting framework that covers clinical task, dataset, model pipeline, calibration, explanation, and surveillance to meet regulatory standards.

What carries the argument

The CALIB-XFUS 22-item reporting framework operationalizing calibration, explanation faithfulness, and fairness across six domains for regulated fetal ultrasound AI.

If this is right

  • Models would need to include uncertainty quantification and selective prediction to support safe clinical decisions.
  • Explanations must be validated by clinicians for faithfulness.
  • Post-market surveillance would track performance in real-world settings.
  • Fairness audits would ensure equitable performance across populations.
  • Compliance with FDA Good Machine Learning Practice and EU AI Act would be facilitated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to other ultrasound or medical imaging classification tasks beyond fetal planes.
  • Adoption of the framework might accelerate regulatory approval processes for similar AI tools.
  • Researchers could test the framework by applying it retrospectively to existing studies to measure improvements in reporting quality.

Load-bearing premise

The 78 studies provide a representative sample of the field and the gaps in calibration and selective prediction reporting are the main barriers to safe clinical deployment.

What would settle it

A prospective clinical trial demonstrating that fetal ultrasound AI without calibration or explanations leads to higher error rates in plane identification compared to calibrated versions would support the necessity, while the opposite would challenge it.

Figures

Figures reproduced from arXiv: 2601.00990 by Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov, \"Ozkan G\"unalp.

Figure 1
Figure 1. Figure 1: A minimal clinical workflow for uncertainty-calibrated explainable plane classification. Cali [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript is a PRISMA 2020 systematic review of 78 studies (2015–2026) on deep learning for fetal ultrasound plane classification that incorporate explainability or uncertainty quantification. It reports a pooled balanced accuracy of 0.93 (95% CI 0.91–0.95) across six standard planes, notes that only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction, and proposes the CALIB-XFUS 22-item reporting framework spanning clinical task, dataset provenance, model pipeline, calibration/selective prediction, explanation faithfulness, and post-market surveillance to support FDA and EU AI Act compliance.

Significance. If the synthesis is representative and the proposed framework gains adoption, the work could help standardize reporting and accelerate safe clinical translation of AI for fetal ultrasound. The pooled accuracy metric offers a useful field benchmark, and the explicit identification of gaps in calibration and selective prediction reporting is a constructive contribution. The feasibility and regulatory-expectation claims, however, would be more robust with direct evidence on integrated implementations.

major comments (1)
  1. [Abstract] Abstract and Discussion: The claim that 'uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible' is load-bearing for the paper's central argument yet rests on the untested assumption that the separate components (calibration in 19/78 studies, selective prediction in 14/78 studies, plus explanation and fairness) can be combined without performance loss or new failure modes. No breakdown is provided of how many studies simultaneously address calibration, explanation faithfulness validation, and fairness auditing while preserving the reported accuracy levels.
minor comments (1)
  1. [Abstract] The search end date of April 30, 2026 appears to be a typographical error or projection; please confirm the actual date used for the literature search.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address the major comment below, agreeing where clarification is needed and indicating the revisions made to strengthen the manuscript without overstating the evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Discussion: The claim that 'uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible' is load-bearing for the paper's central argument yet rests on the untested assumption that the separate components (calibration in 19/78 studies, selective prediction in 14/78 studies, plus explanation and fairness) can be combined without performance loss or new failure modes. No breakdown is provided of how many studies simultaneously address calibration, explanation faithfulness validation, and fairness auditing while preserving the reported accuracy levels.

    Authors: We agree that the original wording could be read as implying seamless integration across all components in existing work, and that a more explicit accounting of overlap would strengthen the argument. Our data extraction focused on individual reporting practices rather than exhaustive cross-tabulation of every possible combination of calibration, selective prediction, explanation faithfulness validation, and fairness auditing. Consequently, we did not quantify the precise number of studies addressing all elements simultaneously while maintaining the reported accuracy. However, the low individual counts (19/78 for calibration; 14/78 for selective prediction) already indicate limited overlap, which is why we developed the CALIB-XFUS framework. Many of the techniques are modular and post-hoc (e.g., temperature scaling for calibration, confidence thresholding for selective prediction, and gradient-based methods for explanation), supporting technical feasibility without requiring joint retraining. We have revised the abstract and discussion to qualify the claim as: the individual components have been demonstrated separately and can be combined using the proposed reporting framework. We have also added a short paragraph in the Discussion explicitly noting the scarcity of fully integrated implementations in the current literature and positioning CALIB-XFUS as a tool to enable such work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in literature synthesis or framework proposal

full rationale

The paper is a PRISMA-guided systematic review synthesizing findings from 78 external studies on fetal ultrasound plane classification, with pooled balanced accuracy and gap reporting (e.g., only 24% reporting calibration) drawn directly from the reviewed literature rather than any internal equations or self-referential constructions. The proposed CALIB-XFUS 22-item framework is presented as an original operationalization spanning six domains, without reduction to fitted inputs, self-citations, or ansatzes from prior author work. The feasibility and regulatory expectation argument extrapolates from the reviewed studies' existence and partial implementations but does not reduce by construction to the paper's own inputs; no load-bearing self-citation chains, uniqueness theorems, or renamings of known results are present. This constitutes a self-contained analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The review depends on PRISMA 2020 as the methodological standard and synthesizes quantitative results from the 78 included studies; the CALIB-XFUS framework is an original construct without independent prior validation.

axioms (1)
  • domain assumption PRISMA 2020 guidelines for systematic reviews and meta-analyses
    Invoked in the abstract to structure the review of 78 studies.
invented entities (1)
  • CALIB-XFUS reporting framework no independent evidence
    purpose: Operationalizes calibration, explanation faithfulness, and fairness for regulated fetal ultrasound AI across six domains
    New 22-item checklist proposed by the authors without reference to prior independent evidence or validation studies.

pith-pipeline@v0.9.0 · 5801 in / 1364 out tokens · 73269 ms · 2026-05-21T15:26:49.929710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    A. N. Angelopoulos and S. Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511, 2022. doi:10.48550/arXiv.2107.07511

  2. [2]

    C. F. Baumgartner, K. Kamnitsas, J. Matthew, T. Fletcher, S. Smith, L. M. Koch, B. Kainz, D. Rueckert, and B. Glocker. SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound.arXiv preprint arXiv:1612.05601, 2016. doi:10.48550/arXiv.1612.05601

  3. [3]

    Borys, Y

    K. Borys, Y . A. Schmitt, M. Nauta, C. Seifert, N. Kr¨amer, C. M. Friedrich, and F. Nensa. Explainable AI in medical imaging: an overview for clinical practitioners - beyond saliency-based XAI approaches. European Journal of Radiology, 162:110786, 2023. doi:10.1016/j.ejrad.2023.110786

  4. [4]

    X. P. Burgos-Artizzu, D. Coronado-Guti ´errez, B. Valenzuela-Alcaraz, E. Bonet-Carne, E. Eixarch, F. Crispi, and E. Gratac ´os. Evaluation of deep convolutional neural networks for automatic classification of common maternal-fetal ultrasound planes.Scientific Reports, 10:10200, 2020. doi:10.1038/s41598-020-67076-5

  5. [5]

    X. P. Burgos-Artizzu, D. Coronado-Guti ´errez, B. Valenzuela-Alcaraz, E. Bonet-Carne, E. Eixarch, F. Crispi, and E. Gratac´os. FETAL PLANES DB: Common maternal-fetal ultrasound images.Zenodo,

  6. [6]

    doi:10.5281/zenodo.3904280

  7. [7]

    Chattopadhyay, A

    A. Chattopadhyay, A. Sarkar, P. Howlader, and V . N. Balasubramanian. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018

  8. [8]

    Chiaburu, F

    T. Chiaburu, F. Haußer, and F. Bießmann. Uncertainty in XAI: human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2):1170–1192, 2024. doi:10.3390/make6020055

  9. [9]

    J. Fu, T. Lu, S. Zhang, and G. Wang. UM-CAM: uncertainty-weighted multi-resolution class activa- tion maps for weakly-supervised fetal brain segmentation.arXiv preprint arXiv:2306.11490, 2023. doi:10.48550/arXiv.2306.11490

  10. [10]

    Z. Gao, G. Tan, C. Wang, J. Lin, B. Pu, S. Li, and K. Li. Graph-enhanced ensembles of multi-scale structure perception deep architecture for fetal ultrasound plane recognition.Engineering Applications of Artificial Intelligence, 136:108885, 2024. doi:10.1016/j.engappai.2024.108885

  11. [11]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

    B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estima- tion using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1612.01474

  12. [12]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning.arXiv preprint arXiv:1506.02142, 2016. doi:10.48550/arXiv.1506.02142. 7

  13. [13]

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017. doi:10.48550/arXiv.1706.04599

  14. [14]

    Harikumar, S

    A. Harikumar, S. Surendran, and S. Gargi. Explainable AI in deep learning based classifi- cation of fetal ultrasound image planes.Procedia Computer Science, 233:1023–1033, 2024. doi:10.1016/j.procs.2024.03.291

  15. [15]

    Nagayasu, S

    Y . Nagayasu, S. Yamada, R. Mitsuhashi, M. Nunode, M. Sawada, A. Sugimoto, T. Sano, D. Fujita, and M. Ohmichi. Visualisation of assessments of explainable AI: determination of difference between the upper arm and thigh in fetal ultrasound using Grad-CAM.Ultrasound in Obstetrics & Gynecology,

  16. [16]

    doi:10.1002/uog.25705

  17. [17]

    Pegios, M

    P. Pegios, M. Lin, N. Weng, M. B. Søndergaard Svendsen, Z. Bashir, S. Bigdeli, A. N. Christensen, M. Tolsgaard, and A. Feragen. Diffusion-based iterative counterfactual explana- tions for fetal ultrasound image quality assessment.arXiv preprint arXiv:2403.08700, 2024. doi:10.48550/arXiv.2403.08700

  18. [18]

    Rahman, M

    R. Rahman, M. G. R. Alam, G. Jeon, M. Z. Uddin, and M. M. Hassan. Demystifying evidential Dempster-Shafer-based feature learning for fetal ultrasound images leveraging fuzzy-contrast enhance- ment and explainable AI.Ultrasonics, 132:107017, 2023. doi:10.1016/j.ultras.2023.107017

  19. [19]

    Why Should I Trust You?

    M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. doi:10.1145/2939672.2939778

  20. [20]

    Sivasubramanian, D

    A. Sivasubramanian, D. Sasidharan, V . Sowmya, and V . Ravi. Efficient feature extraction using light- weight CNN attention-based deep learning architectures for ultrasound fetal plane classification.arXiv preprint arXiv:2410.17396, 2024. doi:10.48550/arXiv.2410.17396

  21. [21]

    Testi, M

    M. Testi, M. C. Fiorentino, M. Ballabio, G. Visani, M. Ciccozzi, E. Frontoni, S. Moccia, and G. Vessio. FetalMLOps: operationalizing machine learning models for standard fetal ultrasound plane classifica- tion.Medical & Biological Engineering & Computing, 2025. doi:10.1007/s11517-025-03436-5. 8 Figure 1: A minimal clinical workflow for uncertainty-calibra...