A Human-Grounded Evaluation of SHAP for Alert Processing

Hilde J.P. Weerts; Mykola Pechenizkiy; Werner van Ipenburg

arxiv: 1907.03324 · v1 · pith:B57OYQNMnew · submitted 2019-07-07 · 💻 cs.LG · cs.HC· stat.ML

A Human-Grounded Evaluation of SHAP for Alert Processing

Hilde J.P. Weerts , Werner van Ipenburg , Mykola Pechenizkiy This is my paper

Pith reviewed 2026-05-25 01:20 UTC · model grok-4.3

classification 💻 cs.LG cs.HCstat.ML

keywords SHAPexplainable AIhuman evaluationalert processingmachine learning interpretabilityXAImodel explanations

0 comments

The pith

SHAP explanations produce no significant improvement in alert processing performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether SHAP explanations help people judge if positive predictions from a machine learning classifier are correct alerts. Experiments involved 159 participants with basic knowledge of explainable machine learning who performed alert processing tasks both with and without the explanations. Qualitative reflections showed that SHAP information affects how participants reach decisions, yet the model's confidence score remains the dominant factor. Statistical tests found no meaningful difference in performance metrics such as accuracy between the two conditions. The work challenges the assumption that providing such explanations will automatically yield better human outcomes in this setting.

Core claim

The central claim is that a human-grounded evaluation of SHAP for alert processing found no statistically significant difference in task performance when explanations were available compared to when they were not, even though the explanations influenced the decision-making process and the model's confidence score continued to serve as the leading source of evidence.

What carries the argument

A controlled comparison of alert processing tasks with and without SHAP explanations, using statistical tests on task utility metrics across three participant groups.

If this is right

The model's confidence score functions as the primary evidence source for human assessors of alerts.
SHAP explanations alter the decision process without translating into measurable gains in correctness assessment.
Intuitions about the practical benefits of local model-agnostic explanations require direct testing in application contexts.
Performance outcomes and process changes must be evaluated separately when assessing explanation utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real domain experts with operational context might show different patterns of reliance on SHAP versus confidence scores.
Combining SHAP with other cues or interactive interfaces could be needed to achieve performance improvements.
The finding raises the question of whether similar null results appear for other explanation methods in alert verification tasks.

Load-bearing premise

Participants possessing only basic knowledge of explainable machine learning adequately represent the decision processes and performance of actual domain experts who routinely process alerts.

What would settle it

A study measuring alert processing performance with and without SHAP using actual operational domain experts instead of participants with only basic XAI knowledge.

Figures

Figures reproduced from arXiv: 1907.03324 by Hilde J.P. Weerts, Mykola Pechenizkiy, Werner van Ipenburg.

**Figure 3.** Figure 3: Setup of task effectiveness in Experiment 1. The number indicates the instance. The color of the box indicates whether SHAP values are provided (white) or not (black). 4.2 Experiment Details Choosing an appropriate classification task for this user experiment is not trivial. On the one hand, the classification task should be non-trivial for humans. On the other hand, participants should have some domain k… view at source ↗

**Figure 1.** Figure 1: Example of an alert processing task in SHAP condition. In the NoSHAP condition, only the left part of the figure is shown. Round 1: Mental Efficiency. Participants are provided with two sets of five instances, A and B. Each instance in set A is in the NoSHAP condition whereas each instance in set B is in the SHAP condition. The two sets are shown in order (see [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Setup of mental efficiency in Experiment 1. The letter (A or B) indicates the instance set, the number (1,2,3,4,5) the instance in the set. The color of the box indicates whether SHAP values are provided (white) or not (black). Round 2: Task Effectiveness. Participants are provided with one set of ten instances the model predicted to be positives. Each instance is shown twice. The first time, an instance … view at source ↗

**Figure 4.** Figure 4: Setup of Experiment 2. The letter (A or B) indicates [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

In the past years, many new explanation methods have been proposed to achieve interpretability of machine learning predictions. However, the utility of these methods in practical applications has not been researched extensively. In this paper we present the results of a human-grounded evaluation of SHAP, an explanation method that has been well-received in the XAI and related communities. In particular, we study whether this local model-agnostic explanation method can be useful for real human domain experts to assess the correctness of positive predictions, i.e. alerts generated by a classifier. We performed experimentation with three different groups of participants (159 in total), who had basic knowledge of explainable machine learning. We performed a qualitative analysis of recorded reflections of experiment participants performing alert processing with and without SHAP information. The results suggest that the SHAP explanations do impact the decision-making process, although the model's confidence score remains to be a leading source of evidence. We statistically test whether there is a significant difference in task utility metrics between tasks for which an explanation was available and tasks in which it was not provided. As opposed to common intuitions, we did not find a significant difference in alert processing performance when a SHAP explanation is available compared to when it is not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a human-grounded evaluation of SHAP explanations for assisting users in processing positive predictions (alerts) from an ML classifier. It describes experiments with 159 participants possessing only basic XAI knowledge, divided into three groups. The work includes qualitative analysis of participants' recorded reflections on decision-making processes with versus without SHAP, plus statistical tests comparing task utility metrics across conditions. The central claim is that SHAP explanations influence decision-making but yield no significant difference in alert-processing performance relative to the no-explanation baseline, with model confidence scores remaining the dominant source of evidence.

Significance. If the null result on performance metrics is reliable for the studied population, the work supplies empirical counter-evidence to common assumptions about the operational utility of local post-hoc explanations such as SHAP in alert-verification settings. The qualitative component offers concrete observations on how users integrate explanations with confidence scores. These elements could inform XAI deployment decisions, though the restriction to non-expert participants substantially narrows the scope of any such implications.

major comments (2)

[Abstract] Abstract: The manuscript positions the study as evaluating utility 'for real human domain experts' who 'routinely process alerts in operational settings,' yet explicitly recruits participants who 'had basic knowledge of explainable machine learning.' No evidence, pilot data, or argument is supplied that the decision processes or performance of this cohort match those of operational domain experts; this mismatch is load-bearing for interpreting the reported null result on task utility metrics.
[Abstract] Abstract (statistical analysis description): The text states that 'statistical tests were performed on task utility metrics' and that 'we did not find a significant difference,' but supplies no information on the precise metrics, sample-size justification or power analysis, choice of statistical procedure, or any multiple-testing correction. These omissions prevent assessment of whether the non-significant result is informative or under-powered.

minor comments (2)

[Abstract] The abstract refers to 'three different groups of participants' without clarifying how the groups map onto the with/without-SHAP conditions or whether between-group differences were analyzed.
[Abstract] The qualitative analysis is described only at a high level ('recorded reflections'); a brief statement of the coding scheme or inter-rater reliability would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. We agree that the abstract wording requires revision to avoid overstatement and will update the manuscript accordingly. Statistical details are elaborated in the full text, but we will consider a brief clarification in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript positions the study as evaluating utility 'for real human domain experts' who 'routinely process alerts in operational settings,' yet explicitly recruits participants who 'had basic knowledge of explainable machine learning.' No evidence, pilot data, or argument is supplied that the decision processes or performance of this cohort match those of operational domain experts; this mismatch is load-bearing for interpreting the reported null result on task utility metrics.

Authors: We agree the abstract phrasing is imprecise. The title and study design frame this as a human-grounded evaluation, which by definition uses non-expert participants with general XAI familiarity rather than operational domain experts. The manuscript does not claim the results generalize to experts or provide evidence of matching decision processes. We will revise the abstract to remove 'real human domain experts,' clarify the participant pool, and state that the null result applies to this cohort. This preserves the core finding without overclaiming scope. revision: yes
Referee: [Abstract] Abstract (statistical analysis description): The text states that 'statistical tests were performed on task utility metrics' and that 'we did not find a significant difference,' but supplies no information on the precise metrics, sample-size justification or power analysis, choice of statistical procedure, or any multiple-testing correction. These omissions prevent assessment of whether the non-significant result is informative or under-powered.

Authors: The abstract is intentionally concise. The full manuscript (Sections 3 and 4) specifies the metrics (accuracy and response time on alert verification), sample size (159 participants across three conditions), statistical procedures (independent t-tests and ANOVA on the utility metrics), and notes that no multiple-testing correction was applied because comparisons were pre-specified. No a priori power analysis was performed; sample size was determined by recruitment feasibility for the online study. We can add one sentence to the abstract summarizing the tests and sample if space permits, but prefer to retain brevity and direct readers to the methods. revision: partial

Circularity Check

0 steps flagged

Empirical human study contains no derivation chain or self-referential reductions

full rationale

The paper reports a controlled user study with 159 participants, qualitative reflections, and statistical tests comparing alert-processing performance with vs. without SHAP explanations. No equations, model fits, or theoretical derivations appear; the null result is a direct empirical comparison between experimental conditions. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is self-contained against its own collected data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of a human-subjects experiment and standard statistical hypothesis testing; these draw on established methods in HCI and statistics rather than new postulates introduced by the paper.

axioms (1)

standard math Standard assumptions of statistical hypothesis testing hold (independent observations, appropriate distributional form for the test statistic, and sufficient power to detect meaningful differences).
The abstract reports performing statistical tests comparing task utility metrics between conditions.

pith-pipeline@v0.9.0 · 5761 in / 1329 out tokens · 30346 ms · 2026-05-25T01:20:16.055235+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships
cs.AI 2024-09 unverdicted novelty 7.0

Proposes the theory of epistemic quasi-partnerships (EQP) to guide the RCC approach (reasons, counterfactuals, confidence) for human-grounded explanations in AI decision support systems.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 http://arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Dheeru Dua and Karra Taniskidou Efi. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml

work page 2017
[3]

Wouter Duivesteijn, Tara Farzami, Thijs Putman, Evertjan Peer, Hilde J. P. Weerts, Jasper N. Adegeest, Gerson Foks, and Mykola Pechenizkiy. 2017. Have It Both Ways - From A/B Testing to A&B Testing with Exceptional Model Mining. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017. 114–126

work page 2017
[4]

Allahyari Hiva and Lavesson Niklas. 2011. User-oriented Assessment of Classifi- cation Model Understandability.Frontiers in Artificial Intelligence and Applications 227 (2011), 11–19. https://doi.org/10.3233/978-1-60750-754-3-11

work page doi:10.3233/978-1-60750-754-3-11 2011
[5]

Sadiq Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba-Alwi, and Najoua Ribata. 2018. Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA. J. Electrical Engineering and Computer Science 9, 2 (Feb. 2018), 447. https://doi.org/10.11591/ijeecs.v9.i2.pp447-459

work page doi:10.11591/ijeecs.v9.i2.pp447-459 2018
[6]

Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. 2011. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51, 1 (apr 2011), 141–154. https://doi.org/10.1016/J.DSS.2010.12.003

work page doi:10.1016/j.dss.2010.12.003 2011
[7]

Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482

work page 2015
[8]

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An Evaluation of the Human-Interpretability of Explanation. arXiv:1902.00006

work page arXiv 2019
[9]

Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 1675–1684. https://doi.org/ 10.1145/2939672.2939874

work page doi:10.1145/2939672.2939874 2016
[10]

Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv:arXiv:1606.03490

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Ying Lu and Judy A. Bean. 1995. On the sample size for one-sided equivalence of sensitivities based upon McNemar's test. Statistics in Medicine 14, 16 (Aug. 1995), 1831–1839. https://doi.org/10.1002/sim.4780141611

work page doi:10.1002/sim.4780141611 1995
[12]

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. arXiv:1802.03888

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774

work page 2017
[14]

Fred G. Paas. 1992. Training strategies for attaining transfer of problem-solving skill in statistics: A cognitive-load approach. Journal of Educational Psychology 84, 4 (1992), 429–434

work page 1992
[15]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011
[16]

Goldstein, Jake M

Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2018. Manipulating and Measuring Model Interpretability. arXiv:1802.07810

work page arXiv 2018
[17]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, New York, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778

work page arXiv 2016
[18]

Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665

work page 2014
[19]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning.SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013

[1] [1]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 http://arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Dheeru Dua and Karra Taniskidou Efi. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml

work page 2017

[3] [3]

Wouter Duivesteijn, Tara Farzami, Thijs Putman, Evertjan Peer, Hilde J. P. Weerts, Jasper N. Adegeest, Gerson Foks, and Mykola Pechenizkiy. 2017. Have It Both Ways - From A/B Testing to A&B Testing with Exceptional Model Mining. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017. 114–126

work page 2017

[4] [4]

Allahyari Hiva and Lavesson Niklas. 2011. User-oriented Assessment of Classifi- cation Model Understandability.Frontiers in Artificial Intelligence and Applications 227 (2011), 11–19. https://doi.org/10.3233/978-1-60750-754-3-11

work page doi:10.3233/978-1-60750-754-3-11 2011

[5] [5]

Sadiq Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba-Alwi, and Najoua Ribata. 2018. Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA. J. Electrical Engineering and Computer Science 9, 2 (Feb. 2018), 447. https://doi.org/10.11591/ijeecs.v9.i2.pp447-459

work page doi:10.11591/ijeecs.v9.i2.pp447-459 2018

[6] [6]

Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. 2011. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51, 1 (apr 2011), 141–154. https://doi.org/10.1016/J.DSS.2010.12.003

work page doi:10.1016/j.dss.2010.12.003 2011

[7] [7]

Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482

work page 2015

[8] [8]

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An Evaluation of the Human-Interpretability of Explanation. arXiv:1902.00006

work page arXiv 2019

[9] [9]

Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 1675–1684. https://doi.org/ 10.1145/2939672.2939874

work page doi:10.1145/2939672.2939874 2016

[10] [10]

Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv:arXiv:1606.03490

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Ying Lu and Judy A. Bean. 1995. On the sample size for one-sided equivalence of sensitivities based upon McNemar's test. Statistics in Medicine 14, 16 (Aug. 1995), 1831–1839. https://doi.org/10.1002/sim.4780141611

work page doi:10.1002/sim.4780141611 1995

[12] [12]

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. arXiv:1802.03888

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774

work page 2017

[14] [14]

Fred G. Paas. 1992. Training strategies for attaining transfer of problem-solving skill in statistics: A cognitive-load approach. Journal of Educational Psychology 84, 4 (1992), 429–434

work page 1992

[15] [15]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011

[16] [16]

Goldstein, Jake M

Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2018. Manipulating and Measuring Model Interpretability. arXiv:1802.07810

work page arXiv 2018

[17] [17]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, New York, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778

work page arXiv 2016

[18] [18]

Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665

work page 2014

[19] [19]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning.SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013