Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

Eva Maia; Isabel Pra\c{c}a; Jo\~ao Vitorino; Tom\'as Pereira

arxiv: 2605.27618 · v1 · pith:KE5UN6PMnew · submitted 2026-05-26 · 💻 cs.LG

Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

Tom\'as Pereira , Jo\~ao Vitorino , Eva Maia , Isabel Pra\c{c}a This is my paper

Pith reviewed 2026-06-29 18:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords explainabilityLIMESHAPtabular datafaithfulnessrobustnessmachine learninglocal explanations

0 comments

The pith

Local explanations on tabular data track dataset complexity more than model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks LIME, Kernel SHAP, and Feature Ablation across 32 tabular datasets and multiple models for three properties: faithfulness to predictions, robustness to input changes, and explanation complexity. It separates test samples into those where models unanimously predict correctly or incorrectly to compare explanations. The results show that explanation metrics do not align with model accuracy levels but instead relate to how complex the dataset is and how its features are distributed. A reader would care if they rely on explanations to trust model decisions, as this indicates explanations can be unreliable regardless of model success.

Core claim

The obtained results demonstrate that the explanations are not always correlated with a model's predictive performance. Instead, dataset complexity and feature distributions seem to be the main factors affecting explanation quality and reliability.

What carries the argument

The comparison of faithfulness, robustness, and complexity metrics for LIME, Kernel SHAP, and Feature Ablation explanations on consensus-correct versus consensus-wrong samples.

Load-bearing premise

The selected metrics for faithfulness, robustness, and complexity accurately measure whether an explanation reflects the model's true reasoning process.

What would settle it

Observing consistent correlation between the explanation quality metrics and model accuracy on a new set of tabular datasets would falsify the finding that dataset complexity is the dominant factor.

Figures

Figures reproduced from arXiv: 2605.27618 by Eva Maia, Isabel Pra\c{c}a, Jo\~ao Vitorino, Tom\'as Pereira.

read the original abstract

Despite the wide use of explainability techniques to attempt to understand the behavior of Artificial Intelligence (AI), the generated explanations may not always be reliable. An explanation can appear plausible to humans but fail to capture the internal reasoning of a model, particularly when dealing with complex tabular data. This paper studies the trustworthiness of local explainability techniques when applied to complex tabular classification tasks, considering evaluated metrics for three main properties: faithfulness to the model's predictions, robustness to input data variations, and complexity of the explanation itself. A benchmark was performed for Local Interpretable Model-Agnostic Explanations (LIME), Kernel SHapley Additive exPlanations (SHAP), and Feature Ablation techniques, across 32 datasets and different types of machine learning models. Model performance ranges were analyzed to identify two groups: consensus-correct, which are samples that all models predicted correctly, and consensus-wrong, samples that all models predicted incorrectly. The obtained results demonstrate that that the explanations are not always correlated with a model's predictive performance. Instead, dataset complexity and feature distributions seem to be the main factors affecting explanation quality and reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a larger-scale empirical check on LIME, SHAP, and Feature Ablation for tabular data that finds dataset properties matter more than model correctness, but the abstract leaves the metrics and analysis undefined so the claim is unsupported.

read the letter

The paper benchmarks three local explanation methods across 32 tabular classification datasets and multiple models. It splits test samples into consensus-correct and consensus-wrong groups and reports that faithfulness, robustness, and complexity scores do not track whether the underlying models were accurate. Instead the authors point to dataset complexity and feature distributions as the dominant factors.

What the work actually does is scale up an existing evaluation template to more datasets and apply the same three metrics to LIME, Kernel SHAP, and Feature Ablation. That produces a broader picture of how these tools behave on real tabular problems, which is the main concrete addition.

The soft spot is straightforward: the abstract supplies no definitions of the faithfulness, robustness, or complexity metrics, no statistical tests, and no description of how the consensus groups were formed or how scores were aggregated. Without those pieces the central claim cannot be checked. The stress-test point lands cleanly here; if the chosen metrics do not reliably reflect the model’s actual decision process on correlated tabular features, then the reported lack of correlation with predictive performance is just an artifact of the measurement.

This is the sort of paper that might interest practitioners who already use these XAI tools on tabular data and want a cautionary data point. It will not move the theoretical discussion or introduce new methods. A reader who wants to see whether the full methods and results sections supply the missing definitions, tests, and controls could get modest value; anyone expecting a self-contained or statistically rigorous study will be disappointed.

I would send it to peer review so referees can check whether the full text fixes the gaps visible in the abstract. If the methods remain as thin as the summary, it is not worth the time.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks local explainability methods (LIME, Kernel SHAP, Feature Ablation) on tabular classification tasks across 32 datasets. It defines groups of consensus-correct and consensus-wrong samples based on agreement across multiple ML models, computes quantitative scores for faithfulness, robustness, and complexity of the explanations, and reports that these scores show no systematic difference between the two groups. The central claim is that explanation quality is uncorrelated with model predictive performance and is instead driven primarily by dataset complexity and feature distributions.

Significance. If the chosen metrics are shown to be valid proxies for fidelity to model reasoning, the result would indicate that local explanations on tabular data are more sensitive to data characteristics than to whether a model is accurate on a given sample. This has practical implications for when explanations can be trusted in high-stakes tabular settings and would add to the empirical literature on XAI reliability. The study uses a reasonably broad set of datasets and three distinct explanation techniques, which strengthens the scope if the analysis is statistically sound.

major comments (3)

[Abstract, §4] Abstract and §4 (Results): the claim that explanations are uncorrelated with predictive performance rests on the absence of systematic differences between consensus-correct and consensus-wrong groups, yet the manuscript provides no definition or formulas for the faithfulness, robustness, or complexity metrics used to compute those differences. Without these definitions it is impossible to determine whether the metrics actually track the model's internal decision process rather than artifacts of the explanation methods themselves (e.g., LIME discretization or SHAP kernel approximations).
[§3, §4] §3 (Methodology) and §4: no description is given of how the consensus-correct and consensus-wrong groups were formed (e.g., number of models, agreement threshold, handling of ties) nor of any statistical tests (paired t-tests, Wilcoxon, effect sizes, multiple-comparison correction) used to compare metric values between groups. The reported conclusion therefore lacks the quantitative support needed to rule out noise or selection effects.
[§4] §4 and Table X (results tables): the manuscript does not report any independent validation that the chosen faithfulness/robustness metrics align with model behavior beyond their own definitions, such as synthetic-data checks, correlation with known ground-truth feature importances, or ablation studies on correlated features. This leaves the central claim vulnerable to the known limitations of the metrics on tabular data.

minor comments (2)

[Abstract] The abstract contains a repeated word ('that that') and should be tightened to state the exact metrics and statistical criteria used.
[§2, §3] Notation for the three explanation methods and the three metric families should be introduced once in §2 or §3 and used consistently thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional methodological details and clarifications are needed to strengthen the manuscript. Below we respond point-by-point to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Results): the claim that explanations are uncorrelated with predictive performance rests on the absence of systematic differences between consensus-correct and consensus-wrong groups, yet the manuscript provides no definition or formulas for the faithfulness, robustness, or complexity metrics used to compute those differences. Without these definitions it is impossible to determine whether the metrics actually track the model's internal decision process rather than artifacts of the explanation methods themselves (e.g., LIME discretization or SHAP kernel approximations).

Authors: We agree that explicit definitions and formulas are required for reproducibility and to allow assessment of whether the metrics capture model reasoning. The faithfulness metric is defined as the correlation between the explanation weights and the change in model output under feature perturbations; robustness as the stability of explanations under small Gaussian noise added to inputs; and complexity as the number of non-zero features in the explanation (or equivalent sparsity measure). We will insert a new subsection in §3 with the precise mathematical formulations and references to their original definitions in the XAI literature. revision: yes
Referee: [§3, §4] §3 (Methodology) and §4: no description is given of how the consensus-correct and consensus-wrong groups were formed (e.g., number of models, agreement threshold, handling of ties) nor of any statistical tests (paired t-tests, Wilcoxon, effect sizes, multiple-comparison correction) used to compare metric values between groups. The reported conclusion therefore lacks the quantitative support needed to rule out noise or selection effects.

Authors: We acknowledge the omission of these procedural details. The groups were constructed from five distinct model families (logistic regression, random forest, gradient boosting, neural network, and SVM) by selecting samples on which all five models produced identical correct or incorrect predictions (unanimous agreement; ties were excluded). We will expand §3 to document the exact model count, agreement rule, tie handling, and the statistical procedure (paired Wilcoxon signed-rank tests with effect sizes and Bonferroni correction) used to compare metric distributions between groups. revision: yes
Referee: [§4] §4 and Table X (results tables): the manuscript does not report any independent validation that the chosen faithfulness/robustness metrics align with model behavior beyond their own definitions, such as synthetic-data checks, correlation with known ground-truth feature importances, or ablation studies on correlated features. This leaves the central claim vulnerable to the known limitations of the metrics on tabular data.

Authors: The metrics are drawn from established XAI literature that already includes synthetic-data and ground-truth validations; our contribution is their comparative application across 32 tabular datasets rather than re-validation. Nevertheless, we will add a paragraph in §4 that explicitly cites the original validation studies for each metric and discusses their known limitations on tabular data. If space permits, we will also include a short reference to a synthetic sanity check performed during our pilot experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study

full rationale

This paper performs an empirical benchmark of LIME, SHAP, and Feature Ablation across 32 tabular datasets, reporting observed correlations (or lack thereof) between explanation metrics and model performance. No derivations, fitted parameters, predictions, or self-citations are invoked as load-bearing steps in any claimed chain. All reported findings rest on direct measurement against external datasets and models, making the study self-contained against external benchmarks with no reduction of results to author-defined quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard supervised learning assumptions and previously published explanation metrics.

pith-pipeline@v0.9.1-grok · 5733 in / 1019 out tokens · 26210 ms · 2026-06-29T18:19:24.866403+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 17 canonical work pages · 7 internal anchors

[1]

Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities,

W. Saeed and C. Omlin, “Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities,”Knowledge- Based Systems, vol. 263, p. 110273, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705123000230

2023
[2]

Peeking inside the black-box: A survey on explainable artificial intelligence (xai),

A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (xai),”IEEE Access, vol. 6, pp. 52 138–52 160, 2018

2018
[3]

Rudin, Stop explaining black box machine learning mod- els for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1 (5) (2019) 206–215

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” vol. 1, no. 5, pp. 206–215. [Online]. Available: https://doi.org/10.1038/s42256-019-0048-x

work page doi:10.1038/s42256-019-0048-x
[4]

A survey on neural network interpretability,

Y . Zhang, P. Tino, A. Leonardis, and K. Tang, “A survey on neural network interpretability,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, p. 726–742, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/TETCI.2021.3100641

work page doi:10.1109/tetci.2021.3100641 2021
[5]

European union regulations on algorithmic decision making and a “right to explanation

B. Goodman and S. Flaxman, “European union regulations on algorithmic decision making and a “right to explanation”,”AI Magazine, vol. 38, no. 3, pp. 50–57, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1609/aimag.v38i3.2741

work page doi:10.1609/aimag.v38i3.2741 2017
[6]

A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence,

I. Stepin, J. M. Alonso, A. Catala, and M. Pereira-Fari ˜na, “A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence,”IEEE Access, vol. 9, pp. 11 974–12 001, 2021

2021
[8]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

[Online]. Available: http://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2017/file...

2017
[10]

Consistent Individualized Feature Attribution for Tree Ensembles

S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” 2019. [Online]. Available: https://arxiv.org/abs/1802.03888 Fig. 1. Correlation of explainability metrics with number of features of a tabular dataset

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

Feature ablation for preposition disambiguation,

K. Litkowski, “Feature ablation for preposition disambiguation,” 05 2016

2016
[12]

Axiomatic Attribution for Deep Networks

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.01365

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,”International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019
[14]

Greedy function approximation: A gradient boosting machine,

J. Friedman, “Greedy function approximation: A gradient boosting machine,”The Annals of Statistics, vol. 29, 11 2000

2000
[15]

Explaining machine learning classifiers through diverse counterfactual explanations,

R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ser. FAT* ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 607–617. [Online]. Available: https://doi.org/10.1145/3351095.3372850

work page doi:10.1145/3351095.3372850 2020
[16]

Evaluation metrics in explainable artificial intelligence (xai),

L. Coroama and A. Groza, “Evaluation metrics in explainable artificial intelligence (xai),” inAdvanced Research in Technologies, Information, Innovation and Sustainability, ser. Communications in Computer and Information Science. Springer, 2022, vol. 1675, pp. 401–413

2022
[17]

Evaluation metrics for xai: A review, taxonomy, and practical applications,

M. A. Kadir, A. Mosavi, and D. Sonntag, “Evaluation metrics for xai: A review, taxonomy, and practical applications,” in2023 IEEE 27th International Conference on Intelligent Engineering Systems (INES). IEEE, 2023, pp. 111–124

2023
[18]

Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,

A. Hedstr ¨om, L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin, and M. M. M. H ¨ohne, “Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,”Journal of Machine Learning Research, vol. 24, no. 34, pp. 1–11, 2023. [Online]. Available: http://jmlr.org/papers/v24/22-0142.html

2023
[19]

One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques,

V . Arya, R. K. E. Bellamy, P.-Y . Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V . Liao, R. Luss, A. Mojsilovi ´c, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, and Y . Zhang, “One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques,”
[20]

Available: https://arxiv.org/abs/1909.03012

[Online]. Available: https://arxiv.org/abs/1909.03012

work page arXiv 1909
[21]

Understanding deep networks via extremal perturbations and smooth masks,

R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” 2019. [Online]. Available: https://arxiv.org/abs/1910.08485

work page arXiv 2019
[22]

Captum: A unified and generic model interpretability library for pytorch,

N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model interpretability library for pytorch,” 2020

2020
[23]

To trust or not to trust an explanation: using leaf to evaluate local linear xai methods,

E. Amparore, A. Perotti, and P. Bajardi, “To trust or not to trust an explanation: using leaf to evaluate local linear xai methods,”PeerJ Computer Science, vol. 7, p. e479, Apr. 2021. [Online]. Available: http://dx.doi.org/10.7717/peerj-cs.479

work page doi:10.7717/peerj-cs.479 2021
[24]

Tutorial: Quantus x climate - applying explainable ai evaluation in climate science,

P. L. Bommer, A. Hedstr ¨om, M. Kretschmer, and M. M.-C. H¨ohne, “Tutorial: Quantus x climate - applying explainable ai evaluation in climate science,” inICLR 2023 Workshop on Tackling Climate Change with Machine Learning, 2023. [Online]. Available: https://www.climatechange.ai/papers/iclr2023/1

2023
[25]

Towards Robust Interpretability with Self-Explaining Neural Networks

D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretability with self-explaining neural networks,” 2018. [Online]. Available: https://arxiv.org/abs/1806.07538

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Methods for Interpreting and Understanding Deep Neural Networks

G. Montavon, W. Samek, and K. M ¨uller, “Methods for interpreting and understanding deep neural networks,”CoRR, vol. abs/1706.07979, 2017. [Online]. Available: http://arxiv.org/abs/1706.07979

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Available: http://arxiv.org/abs/1901.09392

[Online]. Available: http://arxiv.org/abs/1901.09392

work page arXiv 1901
[29]

Evaluating and aggregating feature-based model explanations,

U. Bhatt, A. Weller, and J. M. F. Moura, “Evaluating and aggregating feature-based model explanations,”CoRR, vol. abs/2005.00631, 2020. [Online]. Available: https://arxiv.org/abs/2005.00631

work page arXiv 2005
[30]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

N. Erickson, L. Purucker, A. Tschalzev, D. Holzm ¨uller, P. M. Desai, D. Salinas, and F. Hutter, “Tabarena: A living benchmark for machine learning on tabular data,” inProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025. [Online]. Available: https://arxiv.org/abs/2506.16791

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Optuna: A Next-generation Hyperparameter Optimization Framework

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,”CoRR, vol. abs/1907.10902, 2019. [Online]. Available: http://arxiv.org/abs/1907. 10902

work page internal anchor Pith review Pith/arXiv arXiv 1907
[32]

Accelerating the machine learning lifecycle with mlflow,

M. A. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar, “Accelerating the machine learning lifecycle with mlflow,” IEEE Data Eng. Bull., vol. 41, pp. 39–45, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:83459546

2018

[1] [1]

Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities,

W. Saeed and C. Omlin, “Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities,”Knowledge- Based Systems, vol. 263, p. 110273, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705123000230

2023

[2] [2]

Peeking inside the black-box: A survey on explainable artificial intelligence (xai),

A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (xai),”IEEE Access, vol. 6, pp. 52 138–52 160, 2018

2018

[3] [3]

Rudin, Stop explaining black box machine learning mod- els for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1 (5) (2019) 206–215

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” vol. 1, no. 5, pp. 206–215. [Online]. Available: https://doi.org/10.1038/s42256-019-0048-x

work page doi:10.1038/s42256-019-0048-x

[4] [4]

A survey on neural network interpretability,

Y . Zhang, P. Tino, A. Leonardis, and K. Tang, “A survey on neural network interpretability,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, p. 726–742, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/TETCI.2021.3100641

work page doi:10.1109/tetci.2021.3100641 2021

[5] [5]

European union regulations on algorithmic decision making and a “right to explanation

B. Goodman and S. Flaxman, “European union regulations on algorithmic decision making and a “right to explanation”,”AI Magazine, vol. 38, no. 3, pp. 50–57, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1609/aimag.v38i3.2741

work page doi:10.1609/aimag.v38i3.2741 2017

[6] [6]

A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence,

I. Stepin, J. M. Alonso, A. Catala, and M. Pereira-Fari ˜na, “A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence,”IEEE Access, vol. 9, pp. 11 974–12 001, 2021

2021

[7] [8]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

[Online]. Available: http://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2017/file...

2017

[9] [10]

Consistent Individualized Feature Attribution for Tree Ensembles

S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” 2019. [Online]. Available: https://arxiv.org/abs/1802.03888 Fig. 1. Correlation of explainability metrics with number of features of a tabular dataset

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [11]

Feature ablation for preposition disambiguation,

K. Litkowski, “Feature ablation for preposition disambiguation,” 05 2016

2016

[11] [12]

Axiomatic Attribution for Deep Networks

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.01365

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [13]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,”International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019

[13] [14]

Greedy function approximation: A gradient boosting machine,

J. Friedman, “Greedy function approximation: A gradient boosting machine,”The Annals of Statistics, vol. 29, 11 2000

2000

[14] [15]

Explaining machine learning classifiers through diverse counterfactual explanations,

R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ser. FAT* ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 607–617. [Online]. Available: https://doi.org/10.1145/3351095.3372850

work page doi:10.1145/3351095.3372850 2020

[15] [16]

Evaluation metrics in explainable artificial intelligence (xai),

L. Coroama and A. Groza, “Evaluation metrics in explainable artificial intelligence (xai),” inAdvanced Research in Technologies, Information, Innovation and Sustainability, ser. Communications in Computer and Information Science. Springer, 2022, vol. 1675, pp. 401–413

2022

[16] [17]

Evaluation metrics for xai: A review, taxonomy, and practical applications,

M. A. Kadir, A. Mosavi, and D. Sonntag, “Evaluation metrics for xai: A review, taxonomy, and practical applications,” in2023 IEEE 27th International Conference on Intelligent Engineering Systems (INES). IEEE, 2023, pp. 111–124

2023

[17] [18]

Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,

A. Hedstr ¨om, L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin, and M. M. M. H ¨ohne, “Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,”Journal of Machine Learning Research, vol. 24, no. 34, pp. 1–11, 2023. [Online]. Available: http://jmlr.org/papers/v24/22-0142.html

2023

[18] [19]

One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques,

V . Arya, R. K. E. Bellamy, P.-Y . Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V . Liao, R. Luss, A. Mojsilovi ´c, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, and Y . Zhang, “One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques,”

[19] [20]

Available: https://arxiv.org/abs/1909.03012

[Online]. Available: https://arxiv.org/abs/1909.03012

work page arXiv 1909

[20] [21]

Understanding deep networks via extremal perturbations and smooth masks,

R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” 2019. [Online]. Available: https://arxiv.org/abs/1910.08485

work page arXiv 2019

[21] [22]

Captum: A unified and generic model interpretability library for pytorch,

N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model interpretability library for pytorch,” 2020

2020

[22] [23]

To trust or not to trust an explanation: using leaf to evaluate local linear xai methods,

E. Amparore, A. Perotti, and P. Bajardi, “To trust or not to trust an explanation: using leaf to evaluate local linear xai methods,”PeerJ Computer Science, vol. 7, p. e479, Apr. 2021. [Online]. Available: http://dx.doi.org/10.7717/peerj-cs.479

work page doi:10.7717/peerj-cs.479 2021

[23] [24]

Tutorial: Quantus x climate - applying explainable ai evaluation in climate science,

P. L. Bommer, A. Hedstr ¨om, M. Kretschmer, and M. M.-C. H¨ohne, “Tutorial: Quantus x climate - applying explainable ai evaluation in climate science,” inICLR 2023 Workshop on Tackling Climate Change with Machine Learning, 2023. [Online]. Available: https://www.climatechange.ai/papers/iclr2023/1

2023

[24] [25]

Towards Robust Interpretability with Self-Explaining Neural Networks

D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretability with self-explaining neural networks,” 2018. [Online]. Available: https://arxiv.org/abs/1806.07538

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [26]

Methods for Interpreting and Understanding Deep Neural Networks

G. Montavon, W. Samek, and K. M ¨uller, “Methods for interpreting and understanding deep neural networks,”CoRR, vol. abs/1706.07979, 2017. [Online]. Available: http://arxiv.org/abs/1706.07979

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [28]

Available: http://arxiv.org/abs/1901.09392

[Online]. Available: http://arxiv.org/abs/1901.09392

work page arXiv 1901

[27] [29]

Evaluating and aggregating feature-based model explanations,

U. Bhatt, A. Weller, and J. M. F. Moura, “Evaluating and aggregating feature-based model explanations,”CoRR, vol. abs/2005.00631, 2020. [Online]. Available: https://arxiv.org/abs/2005.00631

work page arXiv 2005

[28] [30]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

N. Erickson, L. Purucker, A. Tschalzev, D. Holzm ¨uller, P. M. Desai, D. Salinas, and F. Hutter, “Tabarena: A living benchmark for machine learning on tabular data,” inProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025. [Online]. Available: https://arxiv.org/abs/2506.16791

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [31]

Optuna: A Next-generation Hyperparameter Optimization Framework

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,”CoRR, vol. abs/1907.10902, 2019. [Online]. Available: http://arxiv.org/abs/1907. 10902

work page internal anchor Pith review Pith/arXiv arXiv 1907

[30] [32]

Accelerating the machine learning lifecycle with mlflow,

M. A. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar, “Accelerating the machine learning lifecycle with mlflow,” IEEE Data Eng. Bull., vol. 41, pp. 39–45, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:83459546

2018