Explainable AI needs formalization

Ahc\`ene Boubekki; Benedict Clark; Danny Panknin; J\"org Martin; Rick Wilming; Rustam Zhumagambetov; Stefan Haufe

arxiv: 2409.14590 · v6 · submitted 2024-09-22 · 💻 cs.LG · cs.AI· stat.ML

Explainable AI needs formalization

Stefan Haufe , Rick Wilming , Benedict Clark , Rustam Zhumagambetov , Ahc\`ene Boubekki , J\"org Martin , Danny Panknin This is my paper

Pith reviewed 2026-05-23 20:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords explainable AIXAIfeature attributionformalizationcorrectness criteriamachine learning interpretabilitymodel diagnosis

0 comments

The pith

Popular XAI methods systematically attribute importance to input features independent of the prediction target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current explainable AI methods fail to answer relevant questions about models, data, or inputs because they credit features that have no bearing on what the model actually predicts. This failure occurs because the methods operate without well-defined problems or objective tests for whether an explanation is correct. Consequently they cannot support reliable diagnosis of models, scientific discovery from data, or identification of actionable intervention targets. The authors conclude that researchers must first specify the exact problem each explanation method is meant to solve and then build methods and evaluation criteria around those definitions. Doing so would produce multiple, use-case-specific standards of explanation correctness along with measurable performance benchmarks.

Core claim

Current XAI methods do not address well-defined problems and are not evaluated against targeted criteria of explanation correctness. As a result they systematically attribute importance to input features that are independent of the prediction target. This limits their utility for diagnosing and correcting data and models, for scientific discovery, and for identifying intervention targets. Researchers should formally define the problems they intend to solve and design methods accordingly, which will lead to diverse use-case-dependent notions of explanation correctness and objective metrics of explanation performance.

What carries the argument

The systematic attribution of importance to input features that are statistically independent of the prediction target, caused by the absence of well-defined problems and targeted correctness criteria.

If this is right

XAI cannot reliably diagnose and correct data and models.
XAI cannot support scientific discovery from machine learning models.
XAI cannot identify intervention targets.
Diverse use-case-dependent notions of explanation correctness will be needed.
Objective metrics of explanation performance can be developed and used to validate algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Formal problem definitions could allow XAI methods to be compared directly against causal or statistical ground truth in controlled settings.
Different application domains may converge on distinct formal correctness criteria rather than a single universal standard.
Once formal criteria exist, existing attribution methods could be shown to be unsuitable for some questions and retained only for others.
Objective performance metrics would enable systematic improvement of XAI algorithms similar to how loss functions drive model training.

Load-bearing premise

The observed attribution to independent features stems primarily from the absence of well-defined problems and targeted correctness criteria rather than from other implementation or data issues.

What would settle it

A demonstration that at least one popular XAI method avoids attributing importance to independent features once a specific problem is formally stated and a targeted correctness criterion is applied.

Figures

Figures reproduced from arXiv: 2409.14590 by Ahc\`ene Boubekki, Benedict Clark, Danny Panknin, J\"org Martin, Rick Wilming, Rustam Zhumagambetov, Stefan Haufe.

read the original abstract

The field of "explainable artificial intelligence" (XAI) seemingly addresses the desire that decisions of machine learning systems should be human-understandable. However, in its current state, XAI itself needs scrutiny. Popular methods cannot reliably answer relevant questions about ML models, their training data, or test inputs, because they systematically attribute importance to input features that are independent of the prediction target. This limits the utility of XAI for diagnosing and correcting data and models, for scientific discovery, and for identifying intervention targets. The fundamental reason for this is that current XAI methods do not address well-defined problems and are not evaluated against targeted criteria of explanation correctness. Researchers should formally define the problems they intend to solve and design methods accordingly. This will lead to diverse use-case-dependent notions of explanation correctness and objective metrics of explanation performance that can be used to validate XAI algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper correctly flags that many XAI methods can credit independent features, but rests the diagnosis on an untested generalization without examples or data.

read the letter

The central observation is that current attribution methods often assign importance to input features that have no relation to the target variable. The authors link this directly to the absence of well-defined problems and specific correctness criteria, and they recommend that the field move toward use-case-specific formalizations and objective performance metrics. That diagnosis is reasonable on its face and points to a real practical issue in model debugging and scientific use of ML.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that current XAI methods cannot reliably answer questions about ML models, training data, or test inputs because they systematically attribute importance to input features independent of the prediction target. The root cause is identified as the absence of well-defined problems and targeted correctness criteria; the authors advocate that researchers should formally define intended problems, yielding use-case-dependent notions of explanation correctness and objective performance metrics.

Significance. If the central position holds, the paper could encourage the XAI community to shift from ad-hoc methods toward problem-specific formalizations, potentially increasing utility for model diagnosis, scientific discovery, and intervention targeting. As a conceptual position piece without new derivations, empirical results, or machine-checked proofs, its value is primarily in framing a methodological critique rather than delivering a technical advance.

major comments (3)

[Abstract] Abstract: the assertion that popular methods 'systematically attribute importance to input features that are independent of the prediction target' is advanced as the central motivation yet is unsupported by any concrete example, counter-factual, reference to a specific method (e.g., SHAP, LIME), or empirical demonstration within the manuscript.
[Full text] Position argument (throughout): the claim that the 'fundamental reason' for the observed failures is the lack of well-defined problems is not accompanied by discussion or exclusion of alternative explanations such as implementation artifacts or data properties, leaving the causal diagnosis untested.
[Recommendations] Recommendations section: the prescription to 'formally define the problems they intend to solve' is not illustrated by even one worked example of a well-defined XAI problem together with a corresponding correctness criterion, reducing the actionability of the proposed remedy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight opportunities to make the central claims more concrete and the recommendations more actionable. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that popular methods 'systematically attribute importance to input features that are independent of the prediction target' is advanced as the central motivation yet is unsupported by any concrete example, counter-factual, reference to a specific method (e.g., SHAP, LIME), or empirical demonstration within the manuscript.

Authors: We agree that the abstract would benefit from a brief concrete illustration to ground the central claim. In revision we will add one short example (e.g., a LIME or SHAP attribution on a simple classifier where a feature receives non-zero importance despite having no causal effect on the predicted class) while preserving the position-paper character of the work. revision: yes
Referee: [Full text] Position argument (throughout): the claim that the 'fundamental reason' for the observed failures is the lack of well-defined problems is not accompanied by discussion or exclusion of alternative explanations such as implementation artifacts or data properties, leaving the causal diagnosis untested.

Authors: The manuscript treats the absence of well-defined problems as fundamental because, without explicit problem statements and correctness criteria, no method can be guaranteed to produce reliable answers regardless of implementation quality or data characteristics. Alternative factors such as bugs or data artifacts are acknowledged as possible contributors but are viewed as downstream of the missing formalization. To address the referee’s concern we will insert a short paragraph explicitly discussing why these alternatives are secondary rather than fundamental. revision: partial
Referee: [Recommendations] Recommendations section: the prescription to 'formally define the problems they intend to solve' is not illustrated by even one worked example of a well-defined XAI problem together with a corresponding correctness criterion, reducing the actionability of the proposed remedy.

Authors: We accept this criticism. The Recommendations section will be expanded with one concise worked example (e.g., formalizing the problem of “identifying features whose removal changes a model’s decision on a given input” together with a correctness criterion based on agreement with an oracle intervention). This addition will directly illustrate the advocated approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position paper without derivations or self-referential claims

full rationale

The paper is a conceptual position argument advocating for formalization of XAI problems and correctness criteria. It advances no equations, derivations, fitted parameters, predictions, or formal statements that could reduce to inputs by construction. The observation that methods attribute importance to independent features is presented as motivation rather than a derived result. No self-citations function as load-bearing uniqueness theorems, and no ansatzes or renamings are smuggled in. The central claim remains independent of any internal reduction, making this a self-contained non-finding consistent with the default expectation for position pieces.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current XAI methods exhibit systematic attribution to independent features and that this behavior is caused by missing formal problem statements.

axioms (1)

domain assumption Popular XAI methods systematically attribute importance to input features independent of the prediction target
This observation is presented as the fundamental reason for limited utility and is not derived within the text.

pith-pipeline@v0.9.0 · 5701 in / 1133 out tokens · 21819 ms · 2026-05-23T20:20:05.133097+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Researchers should formally define the problems they intend to solve first and then design methods accordingly. This will lead to notions of explanation correctness that can be theoretically verified and objective metrics of explanation performance that can be assessed using ground-truth data.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean Recovery theorem (LogicNat ≃ Nat) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The current XAI terminology uses the term 'explanation' indiscriminately... reflective of a deeper absence of well-defined problems for XAI to solve.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Counterfactual explanations for multivariate time series, in: 2021 International Conference on Applied Artificial Intelligence (ICAPAI), pp. 1–8. Babic, B., Gerke, S., Evgeniou, T., Cohen, I.G.,

work page 2021
[2]

arXiv:2107.02033

Quality metrics for transparent machine learning with and without humans in the loop are not correlated, in: ICML Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI. arXiv:2107.02033. Bilodeau, B., Jaques, N., Koh, P.W., Kim, B.,

work page arXiv
[3]

Post-hoc explanations fail to achieve their purpose in adversarial contexts, in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 891–905. Borgonovo, E., Ghidini, V ., Hahn, R., Plischke, E.,

work page 2022
[4]

An official NeurIPS Workshop

Evaluating saliency methods on artificial data with different background types, in: Medical Imaging meets NeurIPS. An official NeurIPS Workshop. arXiv:2112.04882. Clark, B., Wilming, R., Haufe, S.,

work page arXiv
[5]

arXiv preprint arXiv:2103.13701

Ecinn: efficient counterfactuals from invertible neural networks. arXiv preprint arXiv:2103.13701 . Ismail, A.A., Gunady, M., Pessoa, L., Corrada Bravo, H., Feizi, S.,

work page arXiv
[6]

How can i choose an explainer? an application-grounded evaluation of post-hoc explanations, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 805–815. Jiménez-Luna, J., Grisoni, F., Schneider, G.,

work page 2021
[7]

Algorithmic recourse: from counterfactual explanations to interventions, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 353–362. Karimi, A.H., V on Kügelgen, J., Schölkopf, B., Valera, I.,

work page 2021
[8]

IEEE Transactions on Neural Networks and Learning Systems 35, 1926–1940

From clustering to cluster explanations via neural networks. IEEE Transactions on Neural Networks and Learning Systems 35, 1926–1940. 9 Kindermans, P.J., Schütt, K.T., Alber, M., Müller, K.R., Erhan, D., Kim, B., Dähne, S.,

work page 1926
[9]

arXiv:2202.00449

A consistent and efficient evaluation strategy for attribution methods. arXiv:2202.00449. Rudin, C.,

work page arXiv
[10]

arXiv preprint arXiv:2404.12488

Global counterfactual directions. arXiv preprint arXiv:2404.12488 . 10 Sokol, K., Flach, P.,

work page arXiv
[11]

Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track , 1–21

Scrutinizing XAI using linear ground-truth data with suppressor variables. Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track , 1–21. Wilming, R., Dox, A., Schulz, H., Oliveira, M., Clark, B., Haufe, S.,

work page 2022
[12]

arXiv:2406.11547

GECOBench: A gender- controlled text dataset and benchmark for quantifying biases in explanations. arXiv:2406.11547. Wilming, R., Kieslich, L., Clark, B., Haufe, S.,

work page arXiv
[13]

arXiv preprint arXiv:2404.18702

Why you should not trust interpretations in machine learning: Adversarial attacks on partial dependence plots. arXiv preprint arXiv:2404.18702 . Yalcin, O., Fan, X., Liu, S.,

work page arXiv
[14]

arXiv preprint arXiv:2105.09740

Evaluating the correctness of explainable ai algorithms for classification. arXiv preprint arXiv:2105.09740 . Zhou, Y ., Booth, S., Ribeiro, M.T., Shah, J.,

work page arXiv
[15]

Boundaries of Bayes-optimal decisions are shown as well

for two different correlations c and constant variances s2 1 = 0.8 and s2 2 = 0.5. Boundaries of Bayes-optimal decisions are shown as well. The marginal sample distributions illustrate that feature X2 does not carry any class-related information. c) Causal structure of the data in Examples A (left) and B (right). X2 is a so-called suppressor variable that...

work page 2023
[16]

LRP/DTD (Bach et al., 2015; Montavon et al.,

work page 2015

[1] [1]

Counterfactual explanations for multivariate time series, in: 2021 International Conference on Applied Artificial Intelligence (ICAPAI), pp. 1–8. Babic, B., Gerke, S., Evgeniou, T., Cohen, I.G.,

work page 2021

[2] [2]

arXiv:2107.02033

Quality metrics for transparent machine learning with and without humans in the loop are not correlated, in: ICML Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI. arXiv:2107.02033. Bilodeau, B., Jaques, N., Koh, P.W., Kim, B.,

work page arXiv

[3] [3]

Post-hoc explanations fail to achieve their purpose in adversarial contexts, in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 891–905. Borgonovo, E., Ghidini, V ., Hahn, R., Plischke, E.,

work page 2022

[4] [4]

An official NeurIPS Workshop

Evaluating saliency methods on artificial data with different background types, in: Medical Imaging meets NeurIPS. An official NeurIPS Workshop. arXiv:2112.04882. Clark, B., Wilming, R., Haufe, S.,

work page arXiv

[5] [5]

arXiv preprint arXiv:2103.13701

Ecinn: efficient counterfactuals from invertible neural networks. arXiv preprint arXiv:2103.13701 . Ismail, A.A., Gunady, M., Pessoa, L., Corrada Bravo, H., Feizi, S.,

work page arXiv

[6] [6]

How can i choose an explainer? an application-grounded evaluation of post-hoc explanations, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 805–815. Jiménez-Luna, J., Grisoni, F., Schneider, G.,

work page 2021

[7] [7]

Algorithmic recourse: from counterfactual explanations to interventions, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 353–362. Karimi, A.H., V on Kügelgen, J., Schölkopf, B., Valera, I.,

work page 2021

[8] [8]

IEEE Transactions on Neural Networks and Learning Systems 35, 1926–1940

From clustering to cluster explanations via neural networks. IEEE Transactions on Neural Networks and Learning Systems 35, 1926–1940. 9 Kindermans, P.J., Schütt, K.T., Alber, M., Müller, K.R., Erhan, D., Kim, B., Dähne, S.,

work page 1926

[9] [9]

arXiv:2202.00449

A consistent and efficient evaluation strategy for attribution methods. arXiv:2202.00449. Rudin, C.,

work page arXiv

[10] [10]

arXiv preprint arXiv:2404.12488

Global counterfactual directions. arXiv preprint arXiv:2404.12488 . 10 Sokol, K., Flach, P.,

work page arXiv

[11] [11]

Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track , 1–21

Scrutinizing XAI using linear ground-truth data with suppressor variables. Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track , 1–21. Wilming, R., Dox, A., Schulz, H., Oliveira, M., Clark, B., Haufe, S.,

work page 2022

[12] [12]

arXiv:2406.11547

GECOBench: A gender- controlled text dataset and benchmark for quantifying biases in explanations. arXiv:2406.11547. Wilming, R., Kieslich, L., Clark, B., Haufe, S.,

work page arXiv

[13] [13]

arXiv preprint arXiv:2404.18702

Why you should not trust interpretations in machine learning: Adversarial attacks on partial dependence plots. arXiv preprint arXiv:2404.18702 . Yalcin, O., Fan, X., Liu, S.,

work page arXiv

[14] [14]

arXiv preprint arXiv:2105.09740

Evaluating the correctness of explainable ai algorithms for classification. arXiv preprint arXiv:2105.09740 . Zhou, Y ., Booth, S., Ribeiro, M.T., Shah, J.,

work page arXiv

[15] [15]

Boundaries of Bayes-optimal decisions are shown as well

for two different correlations c and constant variances s2 1 = 0.8 and s2 2 = 0.5. Boundaries of Bayes-optimal decisions are shown as well. The marginal sample distributions illustrate that feature X2 does not carry any class-related information. c) Causal structure of the data in Examples A (left) and B (right). X2 is a so-called suppressor variable that...

work page 2023

[16] [16]

LRP/DTD (Bach et al., 2015; Montavon et al.,

work page 2015