A Human-Grounded Evaluation of SHAP for Alert Processing
Pith reviewed 2026-05-25 01:20 UTC · model grok-4.3
The pith
SHAP explanations produce no significant improvement in alert processing performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a human-grounded evaluation of SHAP for alert processing found no statistically significant difference in task performance when explanations were available compared to when they were not, even though the explanations influenced the decision-making process and the model's confidence score continued to serve as the leading source of evidence.
What carries the argument
A controlled comparison of alert processing tasks with and without SHAP explanations, using statistical tests on task utility metrics across three participant groups.
If this is right
- The model's confidence score functions as the primary evidence source for human assessors of alerts.
- SHAP explanations alter the decision process without translating into measurable gains in correctness assessment.
- Intuitions about the practical benefits of local model-agnostic explanations require direct testing in application contexts.
- Performance outcomes and process changes must be evaluated separately when assessing explanation utility.
Where Pith is reading between the lines
- Real domain experts with operational context might show different patterns of reliance on SHAP versus confidence scores.
- Combining SHAP with other cues or interactive interfaces could be needed to achieve performance improvements.
- The finding raises the question of whether similar null results appear for other explanation methods in alert verification tasks.
Load-bearing premise
Participants possessing only basic knowledge of explainable machine learning adequately represent the decision processes and performance of actual domain experts who routinely process alerts.
What would settle it
A study measuring alert processing performance with and without SHAP using actual operational domain experts instead of participants with only basic XAI knowledge.
Figures
read the original abstract
In the past years, many new explanation methods have been proposed to achieve interpretability of machine learning predictions. However, the utility of these methods in practical applications has not been researched extensively. In this paper we present the results of a human-grounded evaluation of SHAP, an explanation method that has been well-received in the XAI and related communities. In particular, we study whether this local model-agnostic explanation method can be useful for real human domain experts to assess the correctness of positive predictions, i.e. alerts generated by a classifier. We performed experimentation with three different groups of participants (159 in total), who had basic knowledge of explainable machine learning. We performed a qualitative analysis of recorded reflections of experiment participants performing alert processing with and without SHAP information. The results suggest that the SHAP explanations do impact the decision-making process, although the model's confidence score remains to be a leading source of evidence. We statistically test whether there is a significant difference in task utility metrics between tasks for which an explanation was available and tasks in which it was not provided. As opposed to common intuitions, we did not find a significant difference in alert processing performance when a SHAP explanation is available compared to when it is not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a human-grounded evaluation of SHAP explanations for assisting users in processing positive predictions (alerts) from an ML classifier. It describes experiments with 159 participants possessing only basic XAI knowledge, divided into three groups. The work includes qualitative analysis of participants' recorded reflections on decision-making processes with versus without SHAP, plus statistical tests comparing task utility metrics across conditions. The central claim is that SHAP explanations influence decision-making but yield no significant difference in alert-processing performance relative to the no-explanation baseline, with model confidence scores remaining the dominant source of evidence.
Significance. If the null result on performance metrics is reliable for the studied population, the work supplies empirical counter-evidence to common assumptions about the operational utility of local post-hoc explanations such as SHAP in alert-verification settings. The qualitative component offers concrete observations on how users integrate explanations with confidence scores. These elements could inform XAI deployment decisions, though the restriction to non-expert participants substantially narrows the scope of any such implications.
major comments (2)
- [Abstract] Abstract: The manuscript positions the study as evaluating utility 'for real human domain experts' who 'routinely process alerts in operational settings,' yet explicitly recruits participants who 'had basic knowledge of explainable machine learning.' No evidence, pilot data, or argument is supplied that the decision processes or performance of this cohort match those of operational domain experts; this mismatch is load-bearing for interpreting the reported null result on task utility metrics.
- [Abstract] Abstract (statistical analysis description): The text states that 'statistical tests were performed on task utility metrics' and that 'we did not find a significant difference,' but supplies no information on the precise metrics, sample-size justification or power analysis, choice of statistical procedure, or any multiple-testing correction. These omissions prevent assessment of whether the non-significant result is informative or under-powered.
minor comments (2)
- [Abstract] The abstract refers to 'three different groups of participants' without clarifying how the groups map onto the with/without-SHAP conditions or whether between-group differences were analyzed.
- [Abstract] The qualitative analysis is described only at a high level ('recorded reflections'); a brief statement of the coding scheme or inter-rater reliability would improve transparency.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below. We agree that the abstract wording requires revision to avoid overstatement and will update the manuscript accordingly. Statistical details are elaborated in the full text, but we will consider a brief clarification in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript positions the study as evaluating utility 'for real human domain experts' who 'routinely process alerts in operational settings,' yet explicitly recruits participants who 'had basic knowledge of explainable machine learning.' No evidence, pilot data, or argument is supplied that the decision processes or performance of this cohort match those of operational domain experts; this mismatch is load-bearing for interpreting the reported null result on task utility metrics.
Authors: We agree the abstract phrasing is imprecise. The title and study design frame this as a human-grounded evaluation, which by definition uses non-expert participants with general XAI familiarity rather than operational domain experts. The manuscript does not claim the results generalize to experts or provide evidence of matching decision processes. We will revise the abstract to remove 'real human domain experts,' clarify the participant pool, and state that the null result applies to this cohort. This preserves the core finding without overclaiming scope. revision: yes
-
Referee: [Abstract] Abstract (statistical analysis description): The text states that 'statistical tests were performed on task utility metrics' and that 'we did not find a significant difference,' but supplies no information on the precise metrics, sample-size justification or power analysis, choice of statistical procedure, or any multiple-testing correction. These omissions prevent assessment of whether the non-significant result is informative or under-powered.
Authors: The abstract is intentionally concise. The full manuscript (Sections 3 and 4) specifies the metrics (accuracy and response time on alert verification), sample size (159 participants across three conditions), statistical procedures (independent t-tests and ANOVA on the utility metrics), and notes that no multiple-testing correction was applied because comparisons were pre-specified. No a priori power analysis was performed; sample size was determined by recruitment feasibility for the online study. We can add one sentence to the abstract summarizing the tests and sample if space permits, but prefer to retain brevity and direct readers to the methods. revision: partial
Circularity Check
Empirical human study contains no derivation chain or self-referential reductions
full rationale
The paper reports a controlled user study with 159 participants, qualitative reflections, and statistical tests comparing alert-processing performance with vs. without SHAP explanations. No equations, model fits, or theoretical derivations appear; the null result is a direct empirical comparison between experimental conditions. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is self-contained against its own collected data.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of statistical hypothesis testing hold (independent observations, appropriate distributional form for the test statistic, and sufficient power to detect meaningful differences).
Forward citations
Cited by 1 Pith paper
-
Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships
Proposes the theory of epistemic quasi-partnerships (EQP) to guide the RCC approach (reasons, counterfactuals, confidence) for human-grounded explanations in AI decision support systems.
Reference graph
Works this paper leans on
-
[1]
Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 http://arxiv.org/abs/1702.08608
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Dheeru Dua and Karra Taniskidou Efi. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
work page 2017
-
[3]
Wouter Duivesteijn, Tara Farzami, Thijs Putman, Evertjan Peer, Hilde J. P. Weerts, Jasper N. Adegeest, Gerson Foks, and Mykola Pechenizkiy. 2017. Have It Both Ways - From A/B Testing to A&B Testing with Exceptional Model Mining. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017. 114–126
work page 2017
-
[4]
Allahyari Hiva and Lavesson Niklas. 2011. User-oriented Assessment of Classifi- cation Model Understandability.Frontiers in Artificial Intelligence and Applications 227 (2011), 11–19. https://doi.org/10.3233/978-1-60750-754-3-11
-
[5]
Sadiq Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba-Alwi, and Najoua Ribata. 2018. Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA. J. Electrical Engineering and Computer Science 9, 2 (Feb. 2018), 447. https://doi.org/10.11591/ijeecs.v9.i2.pp447-459
-
[6]
Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. 2011. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51, 1 (apr 2011), 141–154. https://doi.org/10.1016/J.DSS.2010.12.003
-
[7]
Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482
work page 2015
- [8]
-
[9]
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 1675–1684. https://doi.org/ 10.1145/2939672.2939874
-
[10]
Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv:arXiv:1606.03490
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Ying Lu and Judy A. Bean. 1995. On the sample size for one-sided equivalence of sensitivities based upon McNemar's test. Statistics in Medicine 14, 16 (Aug. 1995), 1831–1839. https://doi.org/10.1002/sim.4780141611
-
[12]
Consistent Individualized Feature Attribution for Tree Ensembles
Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. arXiv:1802.03888
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774
work page 2017
-
[14]
Fred G. Paas. 1992. Training strategies for attaining transfer of problem-solving skill in statistics: A cognitive-load approach. Journal of Educational Psychology 84, 4 (1992), 429–434
work page 1992
-
[15]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830
work page 2011
-
[16]
Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2018. Manipulating and Measuring Model Interpretability. arXiv:1802.07810
-
[17]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, New York, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778
-
[18]
Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665
work page 2014
-
[19]
van Rijn, Bernd Bischl, and Luis Torgo
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning.SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.