Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Models

Hilde J.P. Weerts; Mykola Pechenizkiy; Werner van Ipenburg

arxiv: 1907.03334 · v1 · pith:3ZHSVQWTnew · submitted 2019-07-07 · 💻 cs.LG · cs.HC· stat.ML

Case-Based Reasoning for Assisting Domain Experts in Processing Fraud Alerts of Black-Box Machine Learning Models

Hilde J.P. Weerts , Werner van Ipenburg , Mykola Pechenizkiy This is my paper

Pith reviewed 2026-05-25 01:15 UTC · model grok-4.3

classification 💻 cs.LG cs.HCstat.ML

keywords case-based reasoningfraud detectionpost-hoc explanationsblack-box modelstrustworthinessvisualizationmachine learning

0 comments

The pith

Similarity of local post-hoc explanations enables case-based visualizations that help fraud analysts assess black-box model alerts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a case-based reasoning system that retrieves and visualizes prior instances similar to a new alert according to the similarity of their local post-hoc explanations. The goal is to supply domain experts with concrete evidence bearing on whether a black-box prediction is trustworthy enough to act on. Empirical evaluation indicates that the resulting visualizations support alert processing, and a user study at a major Dutch bank shows the approach is rated useful and easy to use. A sympathetic reader would care because the method offers a practical route to handling opaque high-stakes predictions without demanding full model transparency.

Core claim

A case-based reasoning approach that measures similarity on local post-hoc explanations of predictions can generate visualizations that supply useful evidence on trustworthiness for fraud analysts processing machine-learning alerts.

What carries the argument

Case-based reasoning retrieval that ranks and displays prior cases by similarity of their local post-hoc explanations to supply trustworthiness evidence.

If this is right

The visualization can be useful for processing alerts.
The approach is perceived useful and easy to use by fraud analysts at a major Dutch bank.
Similarity between local post-hoc explanations provides evidence that domain experts can act on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval logic could transfer to other regulated domains that rely on black-box scoring.
If the explanation similarity metric correlates with expert judgment, the system may shorten review time per alert.
The method offers an alternative to building inherently interpretable models when post-hoc tools already exist.

Load-bearing premise

That similarity of local post-hoc explanations between predictions indicates cases that are meaningfully informative about trustworthiness for domain experts.

What would settle it

A controlled comparison in which fraud analysts process the same set of alerts with and without the visualization and show no measurable difference in decision accuracy, speed, or reported .

Figures

Figures reproduced from arXiv: 1907.03334 by Hilde J.P. Weerts, Mykola Pechenizkiy, Werner van Ipenburg.

**Figure 2.** Figure 2: t-SNE visualization that groups transactions with similar SHAP explanations. The SHAP explanations explain pre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: In the simulated user experiment, the dataset is [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Improvement or decrease in average MAP of the estimated user confidence score compared to the MAP of the model’s [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The performance of different neighborhood visual [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: CBR dashboard when applied to predictions of a random forest model trained on the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

In many contexts, it can be useful for domain experts to understand to what extent predictions made by a machine learning model can be trusted. In particular, estimates of trustworthiness can be useful for fraud analysts who process machine learning-generated alerts of fraudulent transactions. In this work, we present a case-based reasoning (CBR) approach that provides evidence on the trustworthiness of a prediction in the form of a visualization of similar previous instances. Different from previous works, we consider similarity of local post-hoc explanations of predictions and show empirically that our visualization can be useful for processing alerts. Furthermore, our approach is perceived useful and easy to use by fraud analysts at a major Dutch bank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper applies CBR to local explanation similarities for fraud alerts and gets bank analyst feedback, but user study details are missing.

read the letter

The paper's key move is using case-based reasoning on the similarity of local explanations rather than the transactions themselves to support fraud analysts reviewing black-box alerts. They built a visualization of similar past cases and tested it with analysts at a major Dutch bank, who reported it as useful and easy to use. This is new in the sense that it applies CBR to explanation similarity, which the authors position as different from earlier work. The practical angle is also a plus: real deployment context and direct feedback from the target users. The soft spot is the evaluation. We get no information on how the user study was run – number of participants, study protocol, controls, or any measures beyond subjective perception. Without that, it's hard to say how much the results support the claim that the visualization helps with processing alerts. The link from similar explanations to trustworthiness is a design choice, not something they test independently. The paper has no heavy math or data issues visible; it's a system description with a perception study. This is for people working on applied interpretability in finance or high-stakes ML. A practitioner might get ideas for their own tools, but a methods-focused reader would want more on the study design. It should go to peer review. The idea is reasonable and the real-world feedback gives it weight, but the authors need to strengthen the evidence section.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces a case-based reasoning (CBR) approach that visualizes similar previous instances based on the similarity of their local post-hoc explanations, to help fraud analysts assess the trustworthiness of black-box ML model alerts for fraudulent transactions. It reports an empirical demonstration that the visualization aids alert processing and that the approach is perceived as useful and easy to use by fraud analysts at a major Dutch bank.

Significance. If the user study is methodologically sound, the work could provide practical evidence for combining CBR with post-hoc explanations in a high-stakes domain, addressing a real need for domain experts to efficiently process ML-generated alerts.

major comments (1)

[Evaluation] Evaluation section: the abstract and manuscript report an empirical demonstration and positive user feedback but supply no information on study design, sample size, statistical tests, controls, or objective measures of processing improvement; this absence prevents evaluation of support for the central claim that the visualization is useful for processing alerts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment on the evaluation. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the abstract and manuscript report an empirical demonstration and positive user feedback but supply no information on study design, sample size, statistical tests, controls, or objective measures of processing improvement; this absence prevents evaluation of support for the central claim that the visualization is useful for processing alerts.

Authors: We agree that the manuscript as submitted provides insufficient detail on the user study methodology. The study was a qualitative evaluation with fraud analysts at the Dutch bank, using questionnaires for perceived usefulness and ease of use (based on TAM) along with think-aloud sessions, but the current text does not report participant count, exact protocol, or any quantitative metrics. In the revision we will expand the Evaluation section with a full description of the study design, sample size, procedure, and any available objective or statistical results to allow proper assessment of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical CBR visualization method based on similarity of local post-hoc explanations, with utility demonstrated via a user study and perception evaluation by fraud analysts. No equations, fitted parameters presented as predictions, self-citation load-bearing steps, or derivation chains exist that reduce any claim to its inputs by construction. The similarity-to-trustworthiness link is introduced as a design premise whose practical value is then tested externally through domain-expert feedback rather than asserted via internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or invented entities; ledger is empty.

pith-pipeline@v0.9.0 · 5656 in / 941 out tokens · 24903 ms · 2026-05-25T01:15:45.141694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

José Miguel Benedí Ruiz, Francisco Casacuberta Nolla, Enrique Vidal Ruiz, In- maculada Benlloch, Antonio Castellanos López, María José Castro Bleda, Jon An- der Gómez Adrián, Alfons Juan Císcar, and Juan Antonio Puchol García. 1991. Proyecto ROARS: Robust Analytical Speech Recognition System. (1991)

work page 1991
[2]

Fred D. Davis. 1989. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly 13, 3 (1989), 319–340

work page 1989
[3]

João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1–44:37. https://doi.org/10.1145/2523813

work page doi:10.1145/2523813 2014
[4]

Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. To Trust Or Not To Trust A Classifier. In Advances in Neural Information Processing Systems 31 , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 5541–5552

work page 2018
[5]

Ron Kohavi. 1997. Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision- Tree Hybrid. KDD (09 1997)

work page 1997
[6]

Kolodner

Janet L. Kolodner. 1992. An introduction to case-based reasoning. Artificial Intelligence Review 6, 1 (1992), 3–34. https://doi.org/10.1007/bf00155578

work page doi:10.1007/bf00155578 1992
[7]

J. B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (01 Mar 1964), 1–27. https: //doi.org/10.1007/BF02289565

work page doi:10.1007/bf02289565 1964
[8]

Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482. http://papers.nips.cc/paper/5658-calibrated-structured-prediction.pdf

work page 2015
[9]

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. (2018). arXiv:1802.03888

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a- unified-approach-to-interpreting-model-predi...

work page 2017
[11]

McArdle and D.C

G.P. McArdle and D.C. Wilson. 2003. Visualising Case-Base Usage. In Workshop Proceedings ICCBR, L. McGinty (Ed.). Trondhuim, 105–114

work page 2003
[12]

Conor Nugent and Pádraig Cunningham. 2005. A Case-Based Explanation System for Black-Box Systems. Artificial Intelligence Review 24, 2 (oct 2005), 163–178. https://doi.org/10.1007/s10462-005-4609-5

work page doi:10.1007/s10462-005-4609-5 2005
[13]

Olson, William La Cava, Patryk Orzechowski, Ryan J

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. https: //doi.org/10.1186/s13040-017-0154-4

work page doi:10.1186/s13040-017-0154-4 2017
[14]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011
[15]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, New York, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778

work page arXiv 2016
[16]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High- Precision Model-Agnostic Explanations. In AAAI

work page 2018
[17]

Frode Sørmo, Jörg Cassens, and Agnar Aamodt. 2005. Explanation in Case-Based Reasoning–Perspectives and Goals. Artificial Intelligence Review 24, 2 (oct 2005), 109–143. https://doi.org/10.1007/s10462-005-4607-7

work page doi:10.1007/s10462-005-4607-7 2005
[18]

Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665. https://doi.org/10.1007/s10115-013-0679-x KDD-ADF ’19, August 2019, Anchorage, Alaska, USA Weerts, et al. Figure 6: CBR dashboard when applied to predictions of a random fores...

work page doi:10.1007/s10115-013-0679-x 2014
[19]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning.SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013
[20]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2018. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Harvard journal of law & technology 31 (04 2018), 841–887

work page 2018
[21]

Weerts, Werner van Ipenburg, and Mykola Pechenizkiy

Hilde J.P. Weerts, Werner van Ipenburg, and Mykola Pechenizkiy. 2019. A Human- Grounded Evaluation of SHAP for Alert Processing. In Proceedings of KDD Work- shop on Explainable AI (KDD-XAI ’19)

work page 2019
[22]

Aha, and Takao Mohri

Dietrich Wettschereck, David W. Aha, and Takao Mohri. 1997. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms. Artificial Intelligence Review 11, 1/5 (1997), 273–314. https://doi.org/ 10.1023/a:1006593614256

work page doi:10.1023/a:1006593614256 1997
[23]

Indre Zliobaite, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society. Springer, 91–114

work page 2016

[1] [1]

José Miguel Benedí Ruiz, Francisco Casacuberta Nolla, Enrique Vidal Ruiz, In- maculada Benlloch, Antonio Castellanos López, María José Castro Bleda, Jon An- der Gómez Adrián, Alfons Juan Císcar, and Juan Antonio Puchol García. 1991. Proyecto ROARS: Robust Analytical Speech Recognition System. (1991)

work page 1991

[2] [2]

Fred D. Davis. 1989. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly 13, 3 (1989), 319–340

work page 1989

[3] [3]

João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1–44:37. https://doi.org/10.1145/2523813

work page doi:10.1145/2523813 2014

[4] [4]

Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. To Trust Or Not To Trust A Classifier. In Advances in Neural Information Processing Systems 31 , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 5541–5552

work page 2018

[5] [5]

Ron Kohavi. 1997. Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision- Tree Hybrid. KDD (09 1997)

work page 1997

[6] [6]

Kolodner

Janet L. Kolodner. 1992. An introduction to case-based reasoning. Artificial Intelligence Review 6, 1 (1992), 3–34. https://doi.org/10.1007/bf00155578

work page doi:10.1007/bf00155578 1992

[7] [7]

J. B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (01 Mar 1964), 1–27. https: //doi.org/10.1007/BF02289565

work page doi:10.1007/bf02289565 1964

[8] [8]

Volodymyr Kuleshov and Percy S Liang. 2015. Calibrated Structured Prediction. In Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 3474–3482. http://papers.nips.cc/paper/5658-calibrated-structured-prediction.pdf

work page 2015

[9] [9]

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individual- ized Feature Attribution for Tree Ensembles. (2018). arXiv:1802.03888

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a- unified-approach-to-interpreting-model-predi...

work page 2017

[11] [11]

McArdle and D.C

G.P. McArdle and D.C. Wilson. 2003. Visualising Case-Base Usage. In Workshop Proceedings ICCBR, L. McGinty (Ed.). Trondhuim, 105–114

work page 2003

[12] [12]

Conor Nugent and Pádraig Cunningham. 2005. A Case-Based Explanation System for Black-Box Systems. Artificial Intelligence Review 24, 2 (oct 2005), 163–178. https://doi.org/10.1007/s10462-005-4609-5

work page doi:10.1007/s10462-005-4609-5 2005

[13] [13]

Olson, William La Cava, Patryk Orzechowski, Ryan J

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. https: //doi.org/10.1186/s13040-017-0154-4

work page doi:10.1186/s13040-017-0154-4 2017

[14] [14]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011

[15] [15]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIG International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, New York, USA, 1135–1144. https://doi.org/10. 1145/2939672.2939778

work page arXiv 2016

[16] [16]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High- Precision Model-Agnostic Explanations. In AAAI

work page 2018

[17] [17]

Frode Sørmo, Jörg Cassens, and Agnar Aamodt. 2005. Explanation in Case-Based Reasoning–Perspectives and Goals. Artificial Intelligence Review 24, 2 (oct 2005), 109–143. https://doi.org/10.1007/s10462-005-4607-7

work page doi:10.1007/s10462-005-4607-7 2005

[18] [18]

Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41, 3 (2014), 647–665. https://doi.org/10.1007/s10115-013-0679-x KDD-ADF ’19, August 2019, Anchorage, Alaska, USA Weerts, et al. Figure 6: CBR dashboard when applied to predictions of a random fores...

work page doi:10.1007/s10115-013-0679-x 2014

[19] [19]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning.SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013

[20] [20]

Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2018. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Harvard journal of law & technology 31 (04 2018), 841–887

work page 2018

[21] [21]

Weerts, Werner van Ipenburg, and Mykola Pechenizkiy

Hilde J.P. Weerts, Werner van Ipenburg, and Mykola Pechenizkiy. 2019. A Human- Grounded Evaluation of SHAP for Alert Processing. In Proceedings of KDD Work- shop on Explainable AI (KDD-XAI ’19)

work page 2019

[22] [22]

Aha, and Takao Mohri

Dietrich Wettschereck, David W. Aha, and Takao Mohri. 1997. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms. Artificial Intelligence Review 11, 1/5 (1997), 273–314. https://doi.org/ 10.1023/a:1006593614256

work page doi:10.1023/a:1006593614256 1997

[23] [23]

Indre Zliobaite, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society. Springer, 91–114

work page 2016