Ranking XAI Methods for Head and Neck Cancer Outcome Prediction
Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3
The pith
A systematic ranking of 13 XAI methods across 24 metrics identifies Integrated Gradients and DeepLIFT as top performers for explaining head and neck cancer outcome predictions from PET/CT images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a comprehensive evaluation of thirteen XAI methods using twenty-four metrics on the HECKTOR multi-center dataset reveals large performance variations, with Integrated Gradients and DeepLIFT achieving high rankings for faithfulness, complexity, and plausibility when interpreting AI models for head and neck cancer outcome prediction.
What carries the argument
The ranking framework of 24 metrics grouped into faithfulness, robustness, complexity, and plausibility, applied to explanations from 13 XAI methods on PET/CT-based models for HNC prognosis.
Load-bearing premise
That the 24 chosen metrics together capture the qualities that make an explanation useful and trustworthy for real clinical decisions in head and neck cancer.
What would settle it
A replication on an independent multi-center dataset where Integrated Gradients and DeepLIFT no longer rank at the top for faithfulness and plausibility, or where the relative ordering of all 13 methods changes substantially.
read the original abstract
For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to be the first comprehensive benchmarking of 13 XAI methods using 24 metrics spanning faithfulness, robustness, complexity, and plausibility for head and neck cancer outcome prediction from PET/CT images on the multi-center HECKTOR dataset. It reports substantial performance variations across methods and that Integrated Gradients and DeepLIFT consistently rank highest in faithfulness, complexity, and plausibility.
Significance. If the rankings prove robust, the work supplies empirical guidance for XAI selection in medical imaging and demonstrates the value of multi-aspect evaluation over ad-hoc choices. This could support more interpretable AI models for HNC prognosis, though its impact depends on whether the proxy metrics align with clinical decision-making needs.
major comments (2)
- The central ranking result depends on the 24 metrics serving as valid proxies for clinical utility in outcome prediction. Faithfulness metrics such as insertion/deletion evaluate pixel-level sensitivity but do not test whether highlighted regions correspond to biologically relevant features (e.g., hypoxic subvolumes or nodal involvement) that drive HNC prognosis. Plausibility is assessed via overlap with segmentation masks rather than clinician ratings of explanatory value for treatment decisions. This disconnect is load-bearing for interpreting the IG/DL rankings as preferable for real-world deployment.
- The abstract and results sections state rankings and 'large variations' without specifying model architectures for the base predictor, exact implementations of the 24 metrics, statistical significance tests, error bars, or preprocessing details. These omissions prevent verification that post-hoc choices did not influence the reported superiority of IG and DL.
minor comments (2)
- Abstract: 'consistently obtained high rankings' contains a tense inconsistency; rephrase to 'consistently obtain high rankings' or similar for grammatical accuracy.
- The manuscript would benefit from an expanded limitations paragraph explicitly addressing the gap between proxy metrics and prospective clinical validation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point-by-point below. Where the comments identify gaps in detail or discussion, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: The central ranking result depends on the 24 metrics serving as valid proxies for clinical utility in outcome prediction. Faithfulness metrics such as insertion/deletion evaluate pixel-level sensitivity but do not test whether highlighted regions correspond to biologically relevant features (e.g., hypoxic subvolumes or nodal involvement) that drive HNC prognosis. Plausibility is assessed via overlap with segmentation masks rather than clinician ratings of explanatory value for treatment decisions. This disconnect is load-bearing for interpreting the IG/DL rankings as preferable for real-world deployment.
Authors: We agree that the 24 metrics are established quantitative proxies rather than direct measures of biological relevance or clinical decision utility. Our benchmarking follows standard XAI evaluation protocols from the literature to enable objective, reproducible comparisons across methods. In the revised manuscript we have added a dedicated Limitations subsection in the Discussion that explicitly acknowledges this gap, notes that segmentation-overlap plausibility is a common but imperfect proxy, and states that future work should incorporate clinician ratings and biological validation (e.g., hypoxic subvolume correlation). The core empirical rankings remain unchanged because they are correctly reported as metric-specific results. revision: yes
-
Referee: The abstract and results sections state rankings and 'large variations' without specifying model architectures for the base predictor, exact implementations of the 24 metrics, statistical significance tests, error bars, or preprocessing details. These omissions prevent verification that post-hoc choices did not influence the reported superiority of IG and DL.
Authors: We acknowledge that the original submission omitted several implementation details required for full reproducibility. In the revised manuscript we have: (1) expanded the Methods section with the precise base predictor architecture (3D ResNet-50 with specific hyperparameters), (2) provided references and pseudocode for each of the 24 metrics, (3) added statistical significance testing (paired Wilcoxon tests with p-values and effect sizes) between top-ranked methods, (4) included error bars on all ranking plots, and (5) detailed the full preprocessing pipeline (resampling, normalization, augmentation). These additions allow independent verification and address the concern that post-hoc choices may have influenced the IG/DL rankings. revision: yes
Circularity Check
Empirical benchmarking with no derivation chain or self-referential reductions
full rationale
The paper performs a direct empirical comparison of 13 standard XAI methods on the public multi-center HECKTOR dataset, computing 24 pre-existing metrics for faithfulness, robustness, complexity and plausibility. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided abstract or described methodology. Results are obtained by applying off-the-shelf XAI techniques and reporting metric values; the ranking therefore does not reduce to any input by construction and remains externally falsifiable on the same dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 24 metrics chosen adequately represent the clinical value of XAI explanations for cancer prognosis models.
Reference graph
Works this paper leans on
-
[1]
Despite similar treatments, substantial variability in outcomes remains among patients
INTRODUCTION Head and neck cancer (HNC) is the seven th most common cancer worldwide [1], treated primarily with r adiotherapy with or without chemotherapy and surgery. Despite similar treatments, substantial variability in outcomes remains among patients . This motivates the development of predictive models to guide personalized treatment. Recent studies...
-
[2]
MATERIALS AND METHODS 2.1. Dataset The latest HECKTOR 2025 training dataset (https://hecktor25.grand-challenge.org/dataset/) was used to develop HNC outcome prediction models. Data from 651 patients, each with CT, PET and Gross Tumor Volume (GTV) mask (Fig. 1.1) of primary tumor and lymph nodes were included. The data was randomly split in a train set of ...
work page 2025
-
[3]
RESULTS The DenseNet121 achieved a C-index of 0.66 in the multi - center test set , which is comparable with results in pre vious studies [4], [7]. Tab. 1 summarizes the mean, median, and standard deviation (std) of the rankings of all XAI methods across the four evaluation aspects in the test set . In general, the ranking variances across methods are rea...
-
[4]
The large standard deviations of rankings in Tab
DISCUSSION This study presented a comprehensive evaluation of 13 post - hoc XAI methods using 20 metrics for HNC outcome prediction task. The large standard deviations of rankings in Tab. 1 reveal substantial variations among XAI methods across metrics , which align s with the observations from LATEC benchmark [8]. This highlights the importance of select...
-
[5]
Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations
CONCLUSION In summary, this study provides a comprehensive evaluation of XAI methods for HNC outcome prediction across four aspects: faithfulness, robustness, complexity, and clinical plausibility. Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations. The results underscore the need for task -specific evaluation and ad...
-
[6]
Ethical approval was not required as confirmed by the license attached with the open access data
COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using human subject data made available in open access by HECKTOR 2025 . Ethical approval was not required as confirmed by the license attached with the open access data
work page 2025
-
[7]
The authors have no relevant financial or non -financial interests to disclose
ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non -financial interests to disclose. We acknowledge the idea discussion provided by Dr. Kennth Gilhuijs
-
[8]
and Laversanne, Mathieu and Soerjomataram, Isabelle and Jemal, Ahmedin and Bray, Freddie , title =
H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA Cancer J Clin, vol. 71, no. 3, pp. 209 –249, May 2021, doi: 10.3322/CAAC.21660
-
[9]
Andrearczyk V, “Overview of the HECKTOR Challenge at MICCAI 2022: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT,” Head and Neck Tumor Segmentation and Outcome Prediction, 2023
work page 2022
-
[10]
V. Andrearczyk et al., “Overview of the HECKTOR challenge at MICCAI 2021: automatic head and neck tumor segmentation and outcome prediction in PET/CT images,” in 3D Head and Neck Tumor Segmentation in PET/CT Challenge, Springer, 2021, pp. 1–37
work page 2021
-
[11]
B. Ma, J. Guo, L. Van Dijk, P. M. A. van Ooijen, S. Both, and N. M. Sijtsema, “TransRP: Transformer -based PET/CT feature extraction incorporating clinical data for recurrence-free survival prediction in oropharyngeal cancer,” in Medical Imaging with Deep Learning, 2023
work page 2023
-
[12]
PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,
B. Ma et al., “PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,” Radiotherapy and Oncology, p. 110368, 2024
work page 2024
-
[13]
Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,
M. Meng, L. Bi, M. Fulham, D. Feng, and J. Kim, “Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,” arXiv preprint arXiv:2307.03427, 2023
-
[14]
B. Ma et al., “PET and CT based DenseNet outperforms advanced deep learning models for outcome prediction of oropharyngeal cancer,” Radiotherapy and Oncology, vol. 207, p. 110852, 2025
work page 2025
-
[15]
Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,
L. Klein, C. Lüth, U. Schlegel, T. Bungert, M. El - Assady, and P. Jäger, “Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,” Adv Neural Inf Process Syst, vol. 37, pp. 67106 – 67146, 2024
work page 2024
-
[16]
Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger
J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger, “DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network,” BMC Med Res Methodol, vol. 18, no. 1, 2018, doi: 10.1186/s12874-018-0482-1
-
[17]
Guidelines and evaluation of clinical explainable AI in medical image analysis,
W. Jin, X. Li, M. Fatehi, and G. Hamarneh, “Guidelines and evaluation of clinical explainable AI in medical image analysis,” Med Image Anal, vol. 84, p. 102684, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.