pith. sign in

arxiv: 2604.04456 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.CL· cs.LG

Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Pith reviewed 2026-05-10 20:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords SHAPrationale stabilityexplainable AIcosine similarityBERTsentiment analysismodel consistencyXAI evaluation
0
0 comments X

The pith

A metric based on cosine similarity of SHAP values quantifies whether model explanations stay consistent for inputs sharing the same label.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to check if AI models use stable reasoning by comparing their explanation patterns across multiple examples that have the same label. Standard evaluations look at one prediction at a time, but this metric uses SHAP attributions and cosine similarity to see if the model focuses on the same features for similar cases. The authors apply it to BERT models on sentiment datasets and test it against usual fidelity checks to find when models might be relying on inconsistent or biased features despite correct predictions. If successful, this gives a practical tool for verifying that explanations reflect the model's intended behavior across variations.

Core claim

The central claim is that rationale stability can be characterized empirically by the average cosine similarity between pairs of SHAP attribution vectors computed on test samples that share the same class label. This provides a way to detect when a model fails to maintain consistent attribution patterns under label-preserving perturbations, thereby revealing deviations from its training objectives.

What carries the argument

The stability metric defined as the cosine similarity of SHAP feature importance vectors for same-label input samples.

If this is right

  • Models showing low metric scores can be flagged for potential bias or misalignment even if their accuracy is high.
  • The metric can be used alongside traditional fidelity metrics to provide a more complete picture of explanation quality.
  • It enables checking consistency under controlled perturbations on datasets like SST-2 and IMDB.
  • Supports building more trustworthy pattern recognition systems by quantifying rationale stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the metric proves reliable, it could be applied during model development to enforce consistency as a training constraint.
  • Low stability might indicate vulnerability to adversarial examples that change explanations without changing labels.
  • Future work could test whether this stability correlates with human judgments of explanation quality.

Load-bearing premise

That the cosine similarity of SHAP vectors for same-label samples accurately reflects whether the model maintains consistent reasoning aligned with its training objectives.

What would settle it

Finding a model that achieves correct predictions on same-label inputs but exhibits low cosine similarity in SHAP attributions, or conversely a model with inconsistent predictions but high similarity, would challenge the metric's validity.

Figures

Figures reproduced from arXiv: 2604.04456 by Abu Noman Md Sakib, Merjulah Roby, Zhensen Wang, Zijie Zhang.

Figure 1
Figure 1. Figure 1: Overall methodology for rationale-stability evaluation. Samples from the data are encoded and processed by transformer classifiers, post-hoc token attributions are computed, and the proposed ESS Score is obtained. 3.2 Model Setup and Preprocessing For our experiments, we employ several pre-trained models and well-established datasets. The models we use are BERT (bert-base-uncased) [6], RoBERTa (roberta￾bas… view at source ↗
Figure 2
Figure 2. Figure 2: Feature Importance for BERT on SST-2 (Test Split). shift after paraphrasing indicates reduced attribution stability even when the predicted label is preserved. 4.3 Analysis The ESS effectively identifies explanation inconsistencies, which are essential for ensuring model reliability and alignment with human values. Lower ESS values for positive sentiment (e.g., paraphrased test: 0.0945) suggest unreliable … view at source ↗
Figure 3
Figure 3. Figure 3: Mean ESS and Fidelity across models. nificance value of p = 0.021 validates that paraphrasing has a statistically signif￾icant effect on the explanation stability. This finding supports the use of para￾phrasing as a tool for detecting misalignment in the model’s reasoning. One key insight is that human interpretable explanations require a certain degree of con￾sistency. In cognitive science, it is widely a… view at source ↗
read the original abstract

Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a novel metric for assessing rationale stability in XAI by computing the cosine similarity of SHAP attribution vectors across same-label inputs (and under label-preserving perturbations). It implements and tests the metric using a pre-trained BERT model on SST-2, with additional experiments on RoBERTa, DistilBERT, and the IMDB dataset, and compares the metric against standard fidelity measures to detect inconsistent model reasoning or biased feature reliance.

Significance. If the metric can be shown to reliably isolate reasoning consistency independent of input length, it would address a genuine gap in instance-centric XAI evaluation by providing a cross-sample consistency check. The public release of the code at the cited GitHub repository is a clear strength that supports reproducibility and further testing.

major comments (2)
  1. [Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'
  2. [Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.
minor comments (1)
  1. [Abstract] The title emphasizes 'Controlled Perturbations' but the abstract provides no concrete description of how label-preserving perturbations are generated or applied; this detail should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will make corresponding revisions to improve clarity and address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'

    Authors: We agree that the manuscript does not provide an explicit description of how variable-length SHAP attribution vectors are made commensurate for cosine similarity. This omission leaves open the possibility that length differences or padding artifacts could influence the metric. In the current implementation, shorter sequences are padded with zero-valued attributions to a fixed maximum length (the model's sequence limit), but no masking or normalization was applied to isolate padding effects. We will revise the experimental setup section to include a precise description of the vector preparation procedure and will add an ablation study examining the metric's sensitivity to different padding and truncation strategies. This will allow readers to assess whether the reported consistency reflects semantic reasoning or length bias. revision: yes

  2. Referee: [Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.

    Authors: The abstract is written at a high level and does not include specific numerical results or table references, which is standard for the format. The body of the manuscript does contain quantitative comparisons, tables of similarity scores versus fidelity metrics, and analysis of detected inconsistencies. However, we acknowledge that the abstract could better signal the strength of these findings and that an explicit ablation on perturbation generation would help rule out confounds such as length sensitivity. We will revise the abstract to include concise references to key quantitative outcomes and will add a dedicated ablation subsection on the perturbation process in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric is an explicit definition

full rationale

The paper defines its central contribution directly as 'the cosine similarity of SHAP values for inputs with the same label' and evaluates it empirically against existing fidelity metrics on BERT/RoBERTa models. No derivation equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The chain consists of applying a standard explainer (SHAP) then computing a standard similarity measure, which is self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields an incomplete ledger. No free parameters are introduced. The central claim rests on the domain assumption that SHAP attributions faithfully reflect feature importance and that cosine similarity is a suitable distance for stability. No invented entities appear.

axioms (1)
  • domain assumption SHAP values provide faithful local feature attributions for the model's predictions
    Invoked when the metric is defined directly on SHAP outputs without additional validation in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1264 out tokens · 52108 ms · 2026-05-10T20:12:55.228874+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

  1. [1]

    Advances in neural information processing systems35, 15784–15799 (2022)

    Agarwal, C., Krishna, S., Saxena, E., Pawelczyk, M., Johnson, N., Puri, I., Zitnik, M., Lakkaraju, H.: Openxai: Towards a transparent evaluation of model explana- tions. Advances in neural information processing systems35, 15784–15799 (2022)

  2. [2]

    Alufaisan, Y., Marusich, L.R., Bakdash, J.Z., Zhou, Y., Kantarcioglu, M.: Does explainable artificial intelligence improve human decision-making? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 6618–6626 (2021)

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Carrow, S., Erwin, K., Vilenskaia, O., Ram, P., Klinger, T., Khan, N., Makondo, N., Gray, A.G.: Neural reasoning networks: Efficient interpretable neural networks with automatic textual explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 15669–15677 (2025)

  4. [4]

    Advances in neural information processing systems32(2019)

    Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019)

  5. [5]

    Sakib et al

    Dervovic, D., Lécué, F., Marchesotti, N., Magazzeni, D.: Are logistic models really interpretable? In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) (2024) 14 A.N.M. Sakib et al

  6. [6]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

  7. [7]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Fel, T., Vigouroux, D., Cadène, R., Serre, T.: How good is your explanation? algorithmic stability measures to assess the quality of explanations for deep neural networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 720–730 (2022)

  8. [8]

    Advances in neural information processing systems34, 18395–18407 (2021)

    Ghalebikesabi, S., Ter-Minassian, L., DiazOrdaz, K., Holmes, C.C.: On locality of local explanation models. Advances in neural information processing systems34, 18395–18407 (2021)

  9. [9]

    Han,B.:Trustworthymachinelearningunderimperfectdata.In:Proceedingsofthe Thirty-Third International Joint Conference on Artificial Intelligence. pp. 8535– 8540 (2024)

  10. [10]

    In: Proceedings of the AAAI conference on artificial intelligence

    Han, T., Tu, W.W., Li, Y.F.: Explanation consistency training: Facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 7639–7646 (2021)

  11. [11]

    Advances in neural information processing systems35, 5256–5268 (2022)

    Han, T., Srinivas, S., Lakkaraju, H.: Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in neural information processing systems35, 5256–5268 (2022)

  12. [12]

    12907–12915 (2023)

    Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Seat: stable and explainable attention.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence.vol.37, pp. 12907–12915 (2023)

  13. [13]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Kraus, K., Kroll, M.: Maximizing signal in human-model preference alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 27392– 27400 (2025)

  14. [14]

    Advances in neural information processing systems30(2017)

    Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Advances in neural information processing systems30(2017)

  15. [15]

    In: Proceed- ings of the AAAI Conference on Artificial Intelligence

    Li, T.: Scalable and trustworthy learning in heterogeneous networks. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28715–28715 (2025)

  16. [16]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  17. [17]

    Advances in Neural Information Processing Systems pp

    Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems pp. 4765–4774 (2017)

  18. [18]

    In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies

    Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. pp. 142– 150 (2011)

  19. [19]

    In: Proceedings of the AAAI Symposium Series

    Mahmud, S., Saisubramanian, S., Zilberstein, S.: Verification and validation of ai systems using explanations. In: Proceedings of the AAAI Symposium Series. vol. 4, pp. 76–80 (2024)

  20. [20]

    Communications of the ACM 38(11), 39–41 (1995)

    Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

  21. [21]

    ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

    Moraffah, R., Karami, M., Guo, R., Raglin, A., Liu, H.: Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

  22. [22]

    Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

    Nguyen, G., Kim, D., Nguyen, A.: The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pillai, V., Koohpayegani, S.A., Ouligian, A., Fong, D., Pirsiavash, H.: Consistent explanations by contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10213–10222 (2022)

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2431– 2439 (2021)

  25. [25]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Qian, W., Zhao, C., Li, Y., Ma, F., Zhang, C., Huai, M.: Towards modeling uncer- tainties of self-explaining neural networks via conformal prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14651–14659 (2024)

  26. [26]

    why should i trust you?

    Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)

  27. [27]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  28. [28]

    In: Proceedings of the IEEE international conference on computer vision

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

  29. [29]

    Advances in neural information processing systems34, 9391–9404 (2021)

    Slack, D., Hilgard, A., Singh, S., Lakkaraju, H.: Reliable post hoc explanations: Modeling uncertainty in explainability. Advances in neural information processing systems34, 9391–9404 (2021)

  30. [30]

    In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

    Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. pp. 180–186 (2020)

  31. [31]

    In: Proceedings of the 2013 conference on empirical methods in natural language processing

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. pp. 1631–1642 (2013)

  32. [32]

    Steck, H., Ekanadham, C., Kallus, N.: Is cosine-similarity of embeddings really about similarity? In: Companion Proceedings of the ACM Web Conference 2024. pp. 887–890 (2024)

  33. [33]

    Advances in Neural Information Processing Systems34, 12966–12977 (2021)

    Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label clean- ing with example-based explanations. Advances in Neural Information Processing Systems34, 12966–12977 (2021)

  34. [34]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Wang, J., Liu, H., Wang, X., Jing, L.: Interpretable image recognition by construct- ing transparent embedding space. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 895–904 (2021)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting cnns via decision trees. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 6261–6270. IEEE (2019)

  36. [36]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Zheng,X.,Shirani,F.,Chen,Z.,Lin,C.,Cheng,W.,Guo,W.,Luo,D.:F-fidelity:A robust framework for faithfulness evaluation of explainable AI. In: The Thirteenth International Conference on Learning Representations (2025)

  37. [37]

    In: The Twelfth International Conference on Learning Representations (2024)

    Zheng, X., Shirani, F., Wang, T., Cheng, W., Chen, Z., Chen, H., Wei, H., Luo, D.: Towards robust fidelity for evaluating explainability of graph neural networks. In: The Twelfth International Conference on Learning Representations (2024)