Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition
Pith reviewed 2026-05-10 20:12 UTC · model grok-4.3
The pith
A metric based on cosine similarity of SHAP values quantifies whether model explanations stay consistent for inputs sharing the same label.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that rationale stability can be characterized empirically by the average cosine similarity between pairs of SHAP attribution vectors computed on test samples that share the same class label. This provides a way to detect when a model fails to maintain consistent attribution patterns under label-preserving perturbations, thereby revealing deviations from its training objectives.
What carries the argument
The stability metric defined as the cosine similarity of SHAP feature importance vectors for same-label input samples.
If this is right
- Models showing low metric scores can be flagged for potential bias or misalignment even if their accuracy is high.
- The metric can be used alongside traditional fidelity metrics to provide a more complete picture of explanation quality.
- It enables checking consistency under controlled perturbations on datasets like SST-2 and IMDB.
- Supports building more trustworthy pattern recognition systems by quantifying rationale stability.
Where Pith is reading between the lines
- If the metric proves reliable, it could be applied during model development to enforce consistency as a training constraint.
- Low stability might indicate vulnerability to adversarial examples that change explanations without changing labels.
- Future work could test whether this stability correlates with human judgments of explanation quality.
Load-bearing premise
That the cosine similarity of SHAP vectors for same-label samples accurately reflects whether the model maintains consistent reasoning aligned with its training objectives.
What would settle it
Finding a model that achieves correct predictions on same-label inputs but exhibits low cosine similarity in SHAP attributions, or conversely a model with inconsistent predictions but high similarity, would challenge the metric's validity.
Figures
read the original abstract
Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a novel metric for assessing rationale stability in XAI by computing the cosine similarity of SHAP attribution vectors across same-label inputs (and under label-preserving perturbations). It implements and tests the metric using a pre-trained BERT model on SST-2, with additional experiments on RoBERTa, DistilBERT, and the IMDB dataset, and compares the metric against standard fidelity measures to detect inconsistent model reasoning or biased feature reliance.
Significance. If the metric can be shown to reliably isolate reasoning consistency independent of input length, it would address a genuine gap in instance-centric XAI evaluation by providing a cross-sample consistency check. The public release of the code at the cited GitHub repository is a clear strength that supports reproducibility and further testing.
major comments (2)
- [Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'
- [Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.
minor comments (1)
- [Abstract] The title emphasizes 'Controlled Perturbations' but the abstract provides no concrete description of how label-preserving perturbations are generated or applied; this detail should be added for clarity.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will make corresponding revisions to improve clarity and address the concerns raised.
read point-by-point responses
-
Referee: [Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'
Authors: We agree that the manuscript does not provide an explicit description of how variable-length SHAP attribution vectors are made commensurate for cosine similarity. This omission leaves open the possibility that length differences or padding artifacts could influence the metric. In the current implementation, shorter sequences are padded with zero-valued attributions to a fixed maximum length (the model's sequence limit), but no masking or normalization was applied to isolate padding effects. We will revise the experimental setup section to include a precise description of the vector preparation procedure and will add an ablation study examining the metric's sensitivity to different padding and truncation strategies. This will allow readers to assess whether the reported consistency reflects semantic reasoning or length bias. revision: yes
-
Referee: [Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.
Authors: The abstract is written at a high level and does not include specific numerical results or table references, which is standard for the format. The body of the manuscript does contain quantitative comparisons, tables of similarity scores versus fidelity metrics, and analysis of detected inconsistencies. However, we acknowledge that the abstract could better signal the strength of these findings and that an explicit ablation on perturbation generation would help rule out confounds such as length sensitivity. We will revise the abstract to include concise references to key quantitative outcomes and will add a dedicated ablation subsection on the perturbation process in the revised manuscript. revision: yes
Circularity Check
No significant circularity; metric is an explicit definition
full rationale
The paper defines its central contribution directly as 'the cosine similarity of SHAP values for inputs with the same label' and evaluates it empirically against existing fidelity metrics on BERT/RoBERTa models. No derivation equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The chain consists of applying a standard explainer (SHAP) then computing a standard similarity measure, which is self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SHAP values provide faithful local feature attributions for the model's predictions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label... ESS = 1/N sum simcos(S(x(i)1), S(x(i)2))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems35, 15784–15799 (2022)
Agarwal, C., Krishna, S., Saxena, E., Pawelczyk, M., Johnson, N., Puri, I., Zitnik, M., Lakkaraju, H.: Openxai: Towards a transparent evaluation of model explana- tions. Advances in neural information processing systems35, 15784–15799 (2022)
work page 2022
-
[2]
Alufaisan, Y., Marusich, L.R., Bakdash, J.Z., Zhou, Y., Kantarcioglu, M.: Does explainable artificial intelligence improve human decision-making? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 6618–6626 (2021)
work page 2021
-
[3]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Carrow, S., Erwin, K., Vilenskaia, O., Ram, P., Klinger, T., Khan, N., Makondo, N., Gray, A.G.: Neural reasoning networks: Efficient interpretable neural networks with automatic textual explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 15669–15677 (2025)
work page 2025
-
[4]
Advances in neural information processing systems32(2019)
Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019)
work page 2019
-
[5]
Dervovic, D., Lécué, F., Marchesotti, N., Magazzeni, D.: Are logistic models really interpretable? In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) (2024) 14 A.N.M. Sakib et al
work page 2024
-
[6]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)
work page 2019
-
[7]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Fel, T., Vigouroux, D., Cadène, R., Serre, T.: How good is your explanation? algorithmic stability measures to assess the quality of explanations for deep neural networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 720–730 (2022)
work page 2022
-
[8]
Advances in neural information processing systems34, 18395–18407 (2021)
Ghalebikesabi, S., Ter-Minassian, L., DiazOrdaz, K., Holmes, C.C.: On locality of local explanation models. Advances in neural information processing systems34, 18395–18407 (2021)
work page 2021
-
[9]
Han,B.:Trustworthymachinelearningunderimperfectdata.In:Proceedingsofthe Thirty-Third International Joint Conference on Artificial Intelligence. pp. 8535– 8540 (2024)
work page 2024
-
[10]
In: Proceedings of the AAAI conference on artificial intelligence
Han, T., Tu, W.W., Li, Y.F.: Explanation consistency training: Facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 7639–7646 (2021)
work page 2021
-
[11]
Advances in neural information processing systems35, 5256–5268 (2022)
Han, T., Srinivas, S., Lakkaraju, H.: Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in neural information processing systems35, 5256–5268 (2022)
work page 2022
-
[12]
Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Seat: stable and explainable attention.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence.vol.37, pp. 12907–12915 (2023)
work page 2023
-
[13]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Kraus, K., Kroll, M.: Maximizing signal in human-model preference alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 27392– 27400 (2025)
work page 2025
-
[14]
Advances in neural information processing systems30(2017)
Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Advances in neural information processing systems30(2017)
work page 2017
-
[15]
In: Proceed- ings of the AAAI Conference on Artificial Intelligence
Li, T.: Scalable and trustworthy learning in heterogeneous networks. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28715–28715 (2025)
work page 2025
-
[16]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[17]
Advances in Neural Information Processing Systems pp
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems pp. 4765–4774 (2017)
work page 2017
-
[18]
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. pp. 142– 150 (2011)
work page 2011
-
[19]
In: Proceedings of the AAAI Symposium Series
Mahmud, S., Saisubramanian, S., Zilberstein, S.: Verification and validation of ai systems using explanations. In: Proceedings of the AAAI Symposium Series. vol. 4, pp. 76–80 (2024)
work page 2024
-
[20]
Communications of the ACM 38(11), 39–41 (1995)
Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)
work page 1995
-
[21]
Moraffah, R., Karami, M., Guo, R., Raglin, A., Liu, H.: Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15
work page 2020
-
[22]
Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)
Nguyen, G., Kim, D., Nguyen, A.: The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)
work page 2021
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pillai, V., Koohpayegani, S.A., Ouligian, A., Fong, D., Pirsiavash, H.: Consistent explanations by contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10213–10222 (2022)
work page 2022
-
[24]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2431– 2439 (2021)
work page 2021
-
[25]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Qian, W., Zhao, C., Li, Y., Ma, F., Zhang, C., Huai, M.: Towards modeling uncer- tainties of self-explaining neural networks via conformal prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14651–14659 (2024)
work page 2024
-
[26]
Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)
work page 2016
-
[27]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
work page internal anchor Pith review arXiv 1910
-
[28]
In: Proceedings of the IEEE international conference on computer vision
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
work page 2017
-
[29]
Advances in neural information processing systems34, 9391–9404 (2021)
Slack, D., Hilgard, A., Singh, S., Lakkaraju, H.: Reliable post hoc explanations: Modeling uncertainty in explainability. Advances in neural information processing systems34, 9391–9404 (2021)
work page 2021
-
[30]
In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. pp. 180–186 (2020)
work page 2020
-
[31]
In: Proceedings of the 2013 conference on empirical methods in natural language processing
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. pp. 1631–1642 (2013)
work page 2013
-
[32]
Steck, H., Ekanadham, C., Kallus, N.: Is cosine-similarity of embeddings really about similarity? In: Companion Proceedings of the ACM Web Conference 2024. pp. 887–890 (2024)
work page 2024
-
[33]
Advances in Neural Information Processing Systems34, 12966–12977 (2021)
Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label clean- ing with example-based explanations. Advances in Neural Information Processing Systems34, 12966–12977 (2021)
work page 2021
-
[34]
In: Proceedings of the IEEE/CVF international conference on computer vision
Wang, J., Liu, H., Wang, X., Jing, L.: Interpretable image recognition by construct- ing transparent embedding space. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 895–904 (2021)
work page 2021
-
[35]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting cnns via decision trees. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 6261–6270. IEEE (2019)
work page 2019
-
[36]
In: The Thirteenth International Conference on Learning Representations (2025)
Zheng,X.,Shirani,F.,Chen,Z.,Lin,C.,Cheng,W.,Guo,W.,Luo,D.:F-fidelity:A robust framework for faithfulness evaluation of explainable AI. In: The Thirteenth International Conference on Learning Representations (2025)
work page 2025
-
[37]
In: The Twelfth International Conference on Learning Representations (2024)
Zheng, X., Shirani, F., Wang, T., Cheng, W., Chen, Z., Chen, H., Wei, H., Luo, D.: Towards robust fidelity for evaluating explainability of graph neural networks. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.