Ranking XAI Methods for Head and Neck Cancer Outcome Prediction

Baoqiang Ma; Djennifer K. Madzia-Madzou; Jin Ouyang; Rosa C.J. Kraaijveld

arxiv: 2604.16034 · v1 · submitted 2026-04-17 · 💻 cs.CV · physics.data-an

Ranking XAI Methods for Head and Neck Cancer Outcome Prediction

Baoqiang Ma , Djennifer K. Madzia-Madzou , Rosa C.J. Kraaijveld , Jin Ouyang This is my paper

Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3

classification 💻 cs.CV physics.data-an

keywords explainable AIXAI evaluationhead and neck canceroutcome predictionPET/CTIntegrated GradientsDeepLIFTHECTOR dataset

0 comments

The pith

A systematic ranking of 13 XAI methods across 24 metrics identifies Integrated Gradients and DeepLIFT as top performers for explaining head and neck cancer outcome predictions from PET/CT images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors seek to replace ad-hoc selection of explainable AI techniques in medical imaging with a full comparison. They evaluate thirteen XAI methods on deep learning models that predict head and neck cancer outcomes from PET and CT scans in the multi-center HECKTOR dataset. Each method is scored on twenty-four metrics that measure how faithfully the explanation reflects the model's logic, how stable it remains under small changes, how simple the output is, and how plausible it appears. The results show wide differences in scores across methods, with Integrated Gradients and DeepLIFT placing consistently high on faithfulness, complexity, and plausibility. This matters for clinical use because trustworthy explanations could help doctors understand AI recommendations when choosing personalized treatments.

Core claim

The paper establishes that a comprehensive evaluation of thirteen XAI methods using twenty-four metrics on the HECKTOR multi-center dataset reveals large performance variations, with Integrated Gradients and DeepLIFT achieving high rankings for faithfulness, complexity, and plausibility when interpreting AI models for head and neck cancer outcome prediction.

What carries the argument

The ranking framework of 24 metrics grouped into faithfulness, robustness, complexity, and plausibility, applied to explanations from 13 XAI methods on PET/CT-based models for HNC prognosis.

Load-bearing premise

That the 24 chosen metrics together capture the qualities that make an explanation useful and trustworthy for real clinical decisions in head and neck cancer.

What would settle it

A replication on an independent multi-center dataset where Integrated Gradients and DeepLIFT no longer rank at the top for faithfulness and plausibility, or where the relative ordering of all 13 methods changes substantially.

read the original abstract

For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward benchmarking study that ranks 13 XAI methods on HNC outcome prediction from PET/CT and finds IG and DL strongest overall, but the proxy metrics leave a gap to actual clinical utility.

read the letter

The paper's core contribution is an empirical ranking of explanation techniques for a prognostic model on the public HECKTOR multi-center dataset. They test 13 methods against 24 metrics spanning faithfulness, robustness, complexity, and plausibility, and report that Integrated Gradients and DeepLIFT come out ahead on most of the key dimensions. This is new for the HNC setting, where earlier work tended to pick XAI tools without systematic comparison, and the scale of the evaluation plus the public data make the results usable as a reference point for similar tasks.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to be the first comprehensive benchmarking of 13 XAI methods using 24 metrics spanning faithfulness, robustness, complexity, and plausibility for head and neck cancer outcome prediction from PET/CT images on the multi-center HECKTOR dataset. It reports substantial performance variations across methods and that Integrated Gradients and DeepLIFT consistently rank highest in faithfulness, complexity, and plausibility.

Significance. If the rankings prove robust, the work supplies empirical guidance for XAI selection in medical imaging and demonstrates the value of multi-aspect evaluation over ad-hoc choices. This could support more interpretable AI models for HNC prognosis, though its impact depends on whether the proxy metrics align with clinical decision-making needs.

major comments (2)

The central ranking result depends on the 24 metrics serving as valid proxies for clinical utility in outcome prediction. Faithfulness metrics such as insertion/deletion evaluate pixel-level sensitivity but do not test whether highlighted regions correspond to biologically relevant features (e.g., hypoxic subvolumes or nodal involvement) that drive HNC prognosis. Plausibility is assessed via overlap with segmentation masks rather than clinician ratings of explanatory value for treatment decisions. This disconnect is load-bearing for interpreting the IG/DL rankings as preferable for real-world deployment.
The abstract and results sections state rankings and 'large variations' without specifying model architectures for the base predictor, exact implementations of the 24 metrics, statistical significance tests, error bars, or preprocessing details. These omissions prevent verification that post-hoc choices did not influence the reported superiority of IG and DL.

minor comments (2)

Abstract: 'consistently obtained high rankings' contains a tense inconsistency; rephrase to 'consistently obtain high rankings' or similar for grammatical accuracy.
The manuscript would benefit from an expanded limitations paragraph explicitly addressing the gap between proxy metrics and prospective clinical validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point-by-point below. Where the comments identify gaps in detail or discussion, we have revised the manuscript accordingly.

read point-by-point responses

Referee: The central ranking result depends on the 24 metrics serving as valid proxies for clinical utility in outcome prediction. Faithfulness metrics such as insertion/deletion evaluate pixel-level sensitivity but do not test whether highlighted regions correspond to biologically relevant features (e.g., hypoxic subvolumes or nodal involvement) that drive HNC prognosis. Plausibility is assessed via overlap with segmentation masks rather than clinician ratings of explanatory value for treatment decisions. This disconnect is load-bearing for interpreting the IG/DL rankings as preferable for real-world deployment.

Authors: We agree that the 24 metrics are established quantitative proxies rather than direct measures of biological relevance or clinical decision utility. Our benchmarking follows standard XAI evaluation protocols from the literature to enable objective, reproducible comparisons across methods. In the revised manuscript we have added a dedicated Limitations subsection in the Discussion that explicitly acknowledges this gap, notes that segmentation-overlap plausibility is a common but imperfect proxy, and states that future work should incorporate clinician ratings and biological validation (e.g., hypoxic subvolume correlation). The core empirical rankings remain unchanged because they are correctly reported as metric-specific results. revision: yes
Referee: The abstract and results sections state rankings and 'large variations' without specifying model architectures for the base predictor, exact implementations of the 24 metrics, statistical significance tests, error bars, or preprocessing details. These omissions prevent verification that post-hoc choices did not influence the reported superiority of IG and DL.

Authors: We acknowledge that the original submission omitted several implementation details required for full reproducibility. In the revised manuscript we have: (1) expanded the Methods section with the precise base predictor architecture (3D ResNet-50 with specific hyperparameters), (2) provided references and pseudocode for each of the 24 metrics, (3) added statistical significance testing (paired Wilcoxon tests with p-values and effect sizes) between top-ranked methods, (4) included error bars on all ranking plots, and (5) detailed the full preprocessing pipeline (resampling, normalization, augmentation). These additions allow independent verification and address the concern that post-hoc choices may have influenced the IG/DL rankings. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking with no derivation chain or self-referential reductions

full rationale

The paper performs a direct empirical comparison of 13 standard XAI methods on the public multi-center HECKTOR dataset, computing 24 pre-existing metrics for faithfulness, robustness, complexity and plausibility. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided abstract or described methodology. Results are obtained by applying off-the-shelf XAI techniques and reporting metric values; the ranking therefore does not reduce to any input by construction and remains externally falsifiable on the same dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical comparison study with no mathematical derivations, new physical entities, or fitted parameters introduced to support the central claim.

axioms (1)

domain assumption The 24 metrics chosen adequately represent the clinical value of XAI explanations for cancer prognosis models.
Invoked when using the metrics to produce final rankings without additional validation against clinician judgment or patient outcomes.

pith-pipeline@v0.9.0 · 5467 in / 1379 out tokens · 43099 ms · 2026-05-10T09:07:58.432368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Despite similar treatments, substantial variability in outcomes remains among patients

INTRODUCTION Head and neck cancer (HNC) is the seven th most common cancer worldwide [1], treated primarily with r adiotherapy with or without chemotherapy and surgery. Despite similar treatments, substantial variability in outcomes remains among patients . This motivates the development of predictive models to guide personalized treatment. Recent studies...

work page
[2]

Dataset The latest HECKTOR 2025 training dataset (https://hecktor25.grand-challenge.org/dataset/) was used to develop HNC outcome prediction models

MATERIALS AND METHODS 2.1. Dataset The latest HECKTOR 2025 training dataset (https://hecktor25.grand-challenge.org/dataset/) was used to develop HNC outcome prediction models. Data from 651 patients, each with CT, PET and Gross Tumor Volume (GTV) mask (Fig. 1.1) of primary tumor and lymph nodes were included. The data was randomly split in a train set of ...

work page 2025
[3]

RESULTS The DenseNet121 achieved a C-index of 0.66 in the multi - center test set , which is comparable with results in pre vious studies [4], [7]. Tab. 1 summarizes the mean, median, and standard deviation (std) of the rankings of all XAI methods across the four evaluation aspects in the test set . In general, the ranking variances across methods are rea...

work page
[4]

The large standard deviations of rankings in Tab

DISCUSSION This study presented a comprehensive evaluation of 13 post - hoc XAI methods using 20 metrics for HNC outcome prediction task. The large standard deviations of rankings in Tab. 1 reveal substantial variations among XAI methods across metrics , which align s with the observations from LATEC benchmark [8]. This highlights the importance of select...

work page
[5]

Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations

CONCLUSION In summary, this study provides a comprehensive evaluation of XAI methods for HNC outcome prediction across four aspects: faithfulness, robustness, complexity, and clinical plausibility. Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations. The results underscore the need for task -specific evaluation and ad...

work page
[6]

Ethical approval was not required as confirmed by the license attached with the open access data

COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using human subject data made available in open access by HECKTOR 2025 . Ethical approval was not required as confirmed by the license attached with the open access data

work page 2025
[7]

The authors have no relevant financial or non -financial interests to disclose

ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non -financial interests to disclose. We acknowledge the idea discussion provided by Dr. Kennth Gilhuijs

work page
[8]

and Laversanne, Mathieu and Soerjomataram, Isabelle and Jemal, Ahmedin and Bray, Freddie , title =

H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA Cancer J Clin, vol. 71, no. 3, pp. 209 –249, May 2021, doi: 10.3322/CAAC.21660

work page doi:10.3322/caac.21660 2020
[9]

Overview of the HECKTOR Challenge at MICCAI 2022: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT,

Andrearczyk V, “Overview of the HECKTOR Challenge at MICCAI 2022: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT,” Head and Neck Tumor Segmentation and Outcome Prediction, 2023

work page 2022
[10]

Overview of the HECKTOR challenge at MICCAI 2021: automatic head and neck tumor segmentation and outcome prediction in PET/CT images,

V. Andrearczyk et al., “Overview of the HECKTOR challenge at MICCAI 2021: automatic head and neck tumor segmentation and outcome prediction in PET/CT images,” in 3D Head and Neck Tumor Segmentation in PET/CT Challenge, Springer, 2021, pp. 1–37

work page 2021
[11]

TransRP: Transformer -based PET/CT feature extraction incorporating clinical data for recurrence-free survival prediction in oropharyngeal cancer,

B. Ma, J. Guo, L. Van Dijk, P. M. A. van Ooijen, S. Both, and N. M. Sijtsema, “TransRP: Transformer -based PET/CT feature extraction incorporating clinical data for recurrence-free survival prediction in oropharyngeal cancer,” in Medical Imaging with Deep Learning, 2023

work page 2023
[12]

PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,

B. Ma et al., “PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,” Radiotherapy and Oncology, p. 110368, 2024

work page 2024
[13]

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,

M. Meng, L. Bi, M. Fulham, D. Feng, and J. Kim, “Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,” arXiv preprint arXiv:2307.03427, 2023

work page arXiv 2023
[14]

PET and CT based DenseNet outperforms advanced deep learning models for outcome prediction of oropharyngeal cancer,

B. Ma et al., “PET and CT based DenseNet outperforms advanced deep learning models for outcome prediction of oropharyngeal cancer,” Radiotherapy and Oncology, vol. 207, p. 110852, 2025

work page 2025
[15]

Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,

L. Klein, C. Lüth, U. Schlegel, T. Bungert, M. El - Assady, and P. Jäger, “Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,” Adv Neural Inf Process Syst, vol. 37, pp. 67106 – 67146, 2024

work page 2024
[16]

Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger

J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger, “DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network,” BMC Med Res Methodol, vol. 18, no. 1, 2018, doi: 10.1186/s12874-018-0482-1

work page doi:10.1186/s12874-018-0482-1 2018
[17]

Guidelines and evaluation of clinical explainable AI in medical image analysis,

W. Jin, X. Li, M. Fatehi, and G. Hamarneh, “Guidelines and evaluation of clinical explainable AI in medical image analysis,” Med Image Anal, vol. 84, p. 102684, 2023

work page 2023

[1] [1]

Despite similar treatments, substantial variability in outcomes remains among patients

INTRODUCTION Head and neck cancer (HNC) is the seven th most common cancer worldwide [1], treated primarily with r adiotherapy with or without chemotherapy and surgery. Despite similar treatments, substantial variability in outcomes remains among patients . This motivates the development of predictive models to guide personalized treatment. Recent studies...

work page

[2] [2]

Dataset The latest HECKTOR 2025 training dataset (https://hecktor25.grand-challenge.org/dataset/) was used to develop HNC outcome prediction models

MATERIALS AND METHODS 2.1. Dataset The latest HECKTOR 2025 training dataset (https://hecktor25.grand-challenge.org/dataset/) was used to develop HNC outcome prediction models. Data from 651 patients, each with CT, PET and Gross Tumor Volume (GTV) mask (Fig. 1.1) of primary tumor and lymph nodes were included. The data was randomly split in a train set of ...

work page 2025

[3] [3]

RESULTS The DenseNet121 achieved a C-index of 0.66 in the multi - center test set , which is comparable with results in pre vious studies [4], [7]. Tab. 1 summarizes the mean, median, and standard deviation (std) of the rankings of all XAI methods across the four evaluation aspects in the test set . In general, the ranking variances across methods are rea...

work page

[4] [4]

The large standard deviations of rankings in Tab

DISCUSSION This study presented a comprehensive evaluation of 13 post - hoc XAI methods using 20 metrics for HNC outcome prediction task. The large standard deviations of rankings in Tab. 1 reveal substantial variations among XAI methods across metrics , which align s with the observations from LATEC benchmark [8]. This highlights the importance of select...

work page

[5] [5]

Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations

CONCLUSION In summary, this study provides a comprehensive evaluation of XAI methods for HNC outcome prediction across four aspects: faithfulness, robustness, complexity, and clinical plausibility. Integrated Gradients and D eep LIFT produced the most faithful and plausible explanations. The results underscore the need for task -specific evaluation and ad...

work page

[6] [6]

Ethical approval was not required as confirmed by the license attached with the open access data

COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using human subject data made available in open access by HECKTOR 2025 . Ethical approval was not required as confirmed by the license attached with the open access data

work page 2025

[7] [7]

The authors have no relevant financial or non -financial interests to disclose

ACKNOWLEDGMENTS No funding was received for conducting this study. The authors have no relevant financial or non -financial interests to disclose. We acknowledge the idea discussion provided by Dr. Kennth Gilhuijs

work page

[8] [8]

and Laversanne, Mathieu and Soerjomataram, Isabelle and Jemal, Ahmedin and Bray, Freddie , title =

H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA Cancer J Clin, vol. 71, no. 3, pp. 209 –249, May 2021, doi: 10.3322/CAAC.21660

work page doi:10.3322/caac.21660 2020

[9] [9]

Overview of the HECKTOR Challenge at MICCAI 2022: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT,

Andrearczyk V, “Overview of the HECKTOR Challenge at MICCAI 2022: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT,” Head and Neck Tumor Segmentation and Outcome Prediction, 2023

work page 2022

[10] [10]

Overview of the HECKTOR challenge at MICCAI 2021: automatic head and neck tumor segmentation and outcome prediction in PET/CT images,

V. Andrearczyk et al., “Overview of the HECKTOR challenge at MICCAI 2021: automatic head and neck tumor segmentation and outcome prediction in PET/CT images,” in 3D Head and Neck Tumor Segmentation in PET/CT Challenge, Springer, 2021, pp. 1–37

work page 2021

[11] [11]

TransRP: Transformer -based PET/CT feature extraction incorporating clinical data for recurrence-free survival prediction in oropharyngeal cancer,

B. Ma, J. Guo, L. Van Dijk, P. M. A. van Ooijen, S. Both, and N. M. Sijtsema, “TransRP: Transformer -based PET/CT feature extraction incorporating clinical data for recurrence-free survival prediction in oropharyngeal cancer,” in Medical Imaging with Deep Learning, 2023

work page 2023

[12] [12]

PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,

B. Ma et al., “PET/CT based transformer model for multi-outcome prediction in oropharyngeal cancer,” Radiotherapy and Oncology, p. 110368, 2024

work page 2024

[13] [13]

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,

M. Meng, L. Bi, M. Fulham, D. Feng, and J. Kim, “Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer,” arXiv preprint arXiv:2307.03427, 2023

work page arXiv 2023

[14] [14]

PET and CT based DenseNet outperforms advanced deep learning models for outcome prediction of oropharyngeal cancer,

B. Ma et al., “PET and CT based DenseNet outperforms advanced deep learning models for outcome prediction of oropharyngeal cancer,” Radiotherapy and Oncology, vol. 207, p. 110852, 2025

work page 2025

[15] [15]

Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,

L. Klein, C. Lüth, U. Schlegel, T. Bungert, M. El - Assady, and P. Jäger, “Navigating the maze of explainable ai: A systematic approach to evaluating methods and metrics,” Adv Neural Inf Process Syst, vol. 37, pp. 67106 – 67146, 2024

work page 2024

[16] [16]

Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger

J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger, “DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network,” BMC Med Res Methodol, vol. 18, no. 1, 2018, doi: 10.1186/s12874-018-0482-1

work page doi:10.1186/s12874-018-0482-1 2018

[17] [17]

Guidelines and evaluation of clinical explainable AI in medical image analysis,

W. Jin, X. Li, M. Fatehi, and G. Hamarneh, “Guidelines and evaluation of clinical explainable AI in medical image analysis,” Med Image Anal, vol. 84, p. 102684, 2023

work page 2023