Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Abu Noman Md Sakib; Merjulah Roby; Zhensen Wang; Zijie Zhang

arxiv: 2604.04456 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.CL· cs.LG

Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Abu Noman Md Sakib , Zhensen Wang , Merjulah Roby , Zijie Zhang This is my paper

Pith reviewed 2026-05-10 20:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords SHAPrationale stabilityexplainable AIcosine similarityBERTsentiment analysismodel consistencyXAI evaluation

0 comments

The pith

A metric based on cosine similarity of SHAP values quantifies whether model explanations stay consistent for inputs sharing the same label.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to check if AI models use stable reasoning by comparing their explanation patterns across multiple examples that have the same label. Standard evaluations look at one prediction at a time, but this metric uses SHAP attributions and cosine similarity to see if the model focuses on the same features for similar cases. The authors apply it to BERT models on sentiment datasets and test it against usual fidelity checks to find when models might be relying on inconsistent or biased features despite correct predictions. If successful, this gives a practical tool for verifying that explanations reflect the model's intended behavior across variations.

Core claim

The central claim is that rationale stability can be characterized empirically by the average cosine similarity between pairs of SHAP attribution vectors computed on test samples that share the same class label. This provides a way to detect when a model fails to maintain consistent attribution patterns under label-preserving perturbations, thereby revealing deviations from its training objectives.

What carries the argument

The stability metric defined as the cosine similarity of SHAP feature importance vectors for same-label input samples.

If this is right

Models showing low metric scores can be flagged for potential bias or misalignment even if their accuracy is high.
The metric can be used alongside traditional fidelity metrics to provide a more complete picture of explanation quality.
It enables checking consistency under controlled perturbations on datasets like SST-2 and IMDB.
Supports building more trustworthy pattern recognition systems by quantifying rationale stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the metric proves reliable, it could be applied during model development to enforce consistency as a training constraint.
Low stability might indicate vulnerability to adversarial examples that change explanations without changing labels.
Future work could test whether this stability correlates with human judgments of explanation quality.

Load-bearing premise

That the cosine similarity of SHAP vectors for same-label samples accurately reflects whether the model maintains consistent reasoning aligned with its training objectives.

What would settle it

Finding a model that achieves correct predictions on same-label inputs but exhibits low cosine similarity in SHAP attributions, or conversely a model with inconsistent predictions but high similarity, would challenge the metric's validity.

Figures

Figures reproduced from arXiv: 2604.04456 by Abu Noman Md Sakib, Merjulah Roby, Zhensen Wang, Zijie Zhang.

**Figure 1.** Figure 1: Overall methodology for rationale-stability evaluation. Samples from the data are encoded and processed by transformer classifiers, post-hoc token attributions are computed, and the proposed ESS Score is obtained. 3.2 Model Setup and Preprocessing For our experiments, we employ several pre-trained models and well-established datasets. The models we use are BERT (bert-base-uncased) [6], RoBERTa (robertabas… view at source ↗

**Figure 2.** Figure 2: Feature Importance for BERT on SST-2 (Test Split). shift after paraphrasing indicates reduced attribution stability even when the predicted label is preserved. 4.3 Analysis The ESS effectively identifies explanation inconsistencies, which are essential for ensuring model reliability and alignment with human values. Lower ESS values for positive sentiment (e.g., paraphrased test: 0.0945) suggest unreliable … view at source ↗

**Figure 3.** Figure 3: Mean ESS and Fidelity across models. nificance value of p = 0.021 validates that paraphrasing has a statistically significant effect on the explanation stability. This finding supports the use of paraphrasing as a tool for detecting misalignment in the model’s reasoning. One key insight is that human interpretable explanations require a certain degree of consistency. In cognitive science, it is widely a… view at source ↗

read the original abstract

Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cosine similarity on unaligned SHAP vectors for variable-length same-label inputs is the main proposal here, and it likely needs a normalization fix before it reliably measures rationale consistency.

read the letter

This paper's core idea is a new metric that measures cosine similarity of SHAP attributions across different inputs with the same label to check for stable rationales under perturbations. They apply it to a BERT model on SST-2 and run similar tests on RoBERTa, DistilBERT, and the IMDB dataset. What stands out is the public code and the multi-model setup, which lets others reproduce the consistency checks. They also position it against fidelity metrics to see if it catches extra inconsistencies like biased feature reliance. The soft spot is the handling of variable-length inputs. Since sentences in these datasets have different token counts, the SHAP vectors differ in dimension. The abstract mentions label-preserving perturbations and controlled changes but says nothing about aligning or normalizing the vectors before taking cosine similarity. Without that step, the metric is probably picking up length differences more than reasoning consistency, which undercuts the main claim. The experiments are only sketched at a high level, with no specific numbers or error analysis provided in the summary, so it's unclear how well it separates good from bad cases in practice. This is the kind of paper that might interest people working on practical XAI evaluation for NLP models. It adds a group-level view that instance-based methods miss, but the length issue needs to be fixed for the metric to be reliable. I think it deserves peer review to get the implementation details and results checked properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes a novel metric for assessing rationale stability in XAI by computing the cosine similarity of SHAP attribution vectors across same-label inputs (and under label-preserving perturbations). It implements and tests the metric using a pre-trained BERT model on SST-2, with additional experiments on RoBERTa, DistilBERT, and the IMDB dataset, and compares the metric against standard fidelity measures to detect inconsistent model reasoning or biased feature reliance.

Significance. If the metric can be shown to reliably isolate reasoning consistency independent of input length, it would address a genuine gap in instance-centric XAI evaluation by providing a cross-sample consistency check. The public release of the code at the cited GitHub repository is a clear strength that supports reproducibility and further testing.

major comments (2)

[Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'
[Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.

minor comments (1)

[Abstract] The title emphasizes 'Controlled Perturbations' but the abstract provides no concrete description of how label-preserving perturbations are generated or applied; this detail should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will make corresponding revisions to improve clarity and address the concerns raised.

read point-by-point responses

Referee: [Abstract and metric definition] The central claim requires that cosine similarity of SHAP vectors reliably signals consistent reasoning for same-label examples. However, SST-2 sentences vary from a few to >30 tokens, so the resulting SHAP vectors have unequal dimensionality. No description is given (in the abstract or the experimental setup) of the alignment, padding, truncation, or aggregation step that renders the vectors commensurate before cosine similarity is computed. If the implementation simply pads with zeros or truncates, the metric will be dominated by length differences and padding tokens rather than semantic feature correspondence, breaking the claimed link to 'consistent reasoning aligned with intended objectives.'

Authors: We agree that the manuscript does not provide an explicit description of how variable-length SHAP attribution vectors are made commensurate for cosine similarity. This omission leaves open the possibility that length differences or padding artifacts could influence the metric. In the current implementation, shorter sequences are padded with zero-valued attributions to a fixed maximum length (the model's sequence limit), but no masking or normalization was applied to isolate padding effects. We will revise the experimental setup section to include a precise description of the vector preparation procedure and will add an ablation study examining the metric's sensitivity to different padding and truncation strategies. This will allow readers to assess whether the reported consistency reflects semantic reasoning or length bias. revision: yes
Referee: [Abstract / Experimental evaluation] The abstract states that the metric is evaluated 'through a series of experiments' that 'identify misaligned predictions' and are 'compared against standard fidelity metrics,' yet no quantitative results, tables, error analysis, or ablation on the perturbation generation process are referenced. Without these, it is impossible to verify whether the new metric actually separates from existing fidelity baselines or whether the reported inconsistencies are driven by the length-sensitivity issue above.

Authors: The abstract is written at a high level and does not include specific numerical results or table references, which is standard for the format. The body of the manuscript does contain quantitative comparisons, tables of similarity scores versus fidelity metrics, and analysis of detected inconsistencies. However, we acknowledge that the abstract could better signal the strength of these findings and that an explicit ablation on perturbation generation would help rule out confounds such as length sensitivity. We will revise the abstract to include concise references to key quantitative outcomes and will add a dedicated ablation subsection on the perturbation process in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric is an explicit definition

full rationale

The paper defines its central contribution directly as 'the cosine similarity of SHAP values for inputs with the same label' and evaluates it empirically against existing fidelity metrics on BERT/RoBERTa models. No derivation equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The chain consists of applying a standard explainer (SHAP) then computing a standard similarity measure, which is self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields an incomplete ledger. No free parameters are introduced. The central claim rests on the domain assumption that SHAP attributions faithfully reflect feature importance and that cosine similarity is a suitable distance for stability. No invented entities appear.

axioms (1)

domain assumption SHAP values provide faithful local feature attributions for the model's predictions
Invoked when the metric is defined directly on SHAP outputs without additional validation in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1264 out tokens · 52108 ms · 2026-05-10T20:12:55.228874+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label... ESS = 1/N sum simcos(S(x(i)1), S(x(i)2))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Advances in neural information processing systems35, 15784–15799 (2022)

Agarwal, C., Krishna, S., Saxena, E., Pawelczyk, M., Johnson, N., Puri, I., Zitnik, M., Lakkaraju, H.: Openxai: Towards a transparent evaluation of model explana- tions. Advances in neural information processing systems35, 15784–15799 (2022)

work page 2022
[2]

Alufaisan, Y., Marusich, L.R., Bakdash, J.Z., Zhou, Y., Kantarcioglu, M.: Does explainable artificial intelligence improve human decision-making? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 6618–6626 (2021)

work page 2021
[3]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Carrow, S., Erwin, K., Vilenskaia, O., Ram, P., Klinger, T., Khan, N., Makondo, N., Gray, A.G.: Neural reasoning networks: Efficient interpretable neural networks with automatic textual explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 15669–15677 (2025)

work page 2025
[4]

Advances in neural information processing systems32(2019)

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019)

work page 2019
[5]

Sakib et al

Dervovic, D., Lécué, F., Marchesotti, N., Magazzeni, D.: Are logistic models really interpretable? In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) (2024) 14 A.N.M. Sakib et al

work page 2024
[6]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019
[7]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Fel, T., Vigouroux, D., Cadène, R., Serre, T.: How good is your explanation? algorithmic stability measures to assess the quality of explanations for deep neural networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 720–730 (2022)

work page 2022
[8]

Advances in neural information processing systems34, 18395–18407 (2021)

Ghalebikesabi, S., Ter-Minassian, L., DiazOrdaz, K., Holmes, C.C.: On locality of local explanation models. Advances in neural information processing systems34, 18395–18407 (2021)

work page 2021
[9]

Han,B.:Trustworthymachinelearningunderimperfectdata.In:Proceedingsofthe Thirty-Third International Joint Conference on Artificial Intelligence. pp. 8535– 8540 (2024)

work page 2024
[10]

In: Proceedings of the AAAI conference on artificial intelligence

Han, T., Tu, W.W., Li, Y.F.: Explanation consistency training: Facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 7639–7646 (2021)

work page 2021
[11]

Advances in neural information processing systems35, 5256–5268 (2022)

Han, T., Srinivas, S., Lakkaraju, H.: Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in neural information processing systems35, 5256–5268 (2022)

work page 2022
[12]

12907–12915 (2023)

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Seat: stable and explainable attention.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence.vol.37, pp. 12907–12915 (2023)

work page 2023
[13]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Kraus, K., Kroll, M.: Maximizing signal in human-model preference alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 27392– 27400 (2025)

work page 2025
[14]

Advances in neural information processing systems30(2017)

Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Advances in neural information processing systems30(2017)

work page 2017
[15]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence

Li, T.: Scalable and trustworthy learning in heterogeneous networks. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28715–28715 (2025)

work page 2025
[16]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[17]

Advances in Neural Information Processing Systems pp

Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems pp. 4765–4774 (2017)

work page 2017
[18]

In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies

Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. pp. 142– 150 (2011)

work page 2011
[19]

In: Proceedings of the AAAI Symposium Series

Mahmud, S., Saisubramanian, S., Zilberstein, S.: Verification and validation of ai systems using explanations. In: Proceedings of the AAAI Symposium Series. vol. 4, pp. 76–80 (2024)

work page 2024
[20]

Communications of the ACM 38(11), 39–41 (1995)

Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

work page 1995
[21]

ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

Moraffah, R., Karami, M., Guo, R., Raglin, A., Liu, H.: Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

work page 2020
[22]

Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

Nguyen, G., Kim, D., Nguyen, A.: The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

work page 2021
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pillai, V., Koohpayegani, S.A., Ouligian, A., Fong, D., Pirsiavash, H.: Consistent explanations by contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10213–10222 (2022)

work page 2022
[24]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2431– 2439 (2021)

work page 2021
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Qian, W., Zhao, C., Li, Y., Ma, F., Zhang, C., Huai, M.: Towards modeling uncer- tainties of self-explaining neural networks via conformal prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14651–14659 (2024)

work page 2024
[26]

why should i trust you?

Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)

work page 2016
[27]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

work page internal anchor Pith review arXiv 1910
[28]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

work page 2017
[29]

Advances in neural information processing systems34, 9391–9404 (2021)

Slack, D., Hilgard, A., Singh, S., Lakkaraju, H.: Reliable post hoc explanations: Modeling uncertainty in explainability. Advances in neural information processing systems34, 9391–9404 (2021)

work page 2021
[30]

In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. pp. 180–186 (2020)

work page 2020
[31]

In: Proceedings of the 2013 conference on empirical methods in natural language processing

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. pp. 1631–1642 (2013)

work page 2013
[32]

Steck, H., Ekanadham, C., Kallus, N.: Is cosine-similarity of embeddings really about similarity? In: Companion Proceedings of the ACM Web Conference 2024. pp. 887–890 (2024)

work page 2024
[33]

Advances in Neural Information Processing Systems34, 12966–12977 (2021)

Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label clean- ing with example-based explanations. Advances in Neural Information Processing Systems34, 12966–12977 (2021)

work page 2021
[34]

In: Proceedings of the IEEE/CVF international conference on computer vision

Wang, J., Liu, H., Wang, X., Jing, L.: Interpretable image recognition by construct- ing transparent embedding space. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 895–904 (2021)

work page 2021
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting cnns via decision trees. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 6261–6270. IEEE (2019)

work page 2019
[36]

In: The Thirteenth International Conference on Learning Representations (2025)

Zheng,X.,Shirani,F.,Chen,Z.,Lin,C.,Cheng,W.,Guo,W.,Luo,D.:F-fidelity:A robust framework for faithfulness evaluation of explainable AI. In: The Thirteenth International Conference on Learning Representations (2025)

work page 2025
[37]

In: The Twelfth International Conference on Learning Representations (2024)

Zheng, X., Shirani, F., Wang, T., Cheng, W., Chen, Z., Chen, H., Wei, H., Luo, D.: Towards robust fidelity for evaluating explainability of graph neural networks. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024

[1] [1]

Advances in neural information processing systems35, 15784–15799 (2022)

Agarwal, C., Krishna, S., Saxena, E., Pawelczyk, M., Johnson, N., Puri, I., Zitnik, M., Lakkaraju, H.: Openxai: Towards a transparent evaluation of model explana- tions. Advances in neural information processing systems35, 15784–15799 (2022)

work page 2022

[2] [2]

Alufaisan, Y., Marusich, L.R., Bakdash, J.Z., Zhou, Y., Kantarcioglu, M.: Does explainable artificial intelligence improve human decision-making? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 6618–6626 (2021)

work page 2021

[3] [3]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Carrow, S., Erwin, K., Vilenskaia, O., Ram, P., Klinger, T., Khan, N., Makondo, N., Gray, A.G.: Neural reasoning networks: Efficient interpretable neural networks with automatic textual explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 15669–15677 (2025)

work page 2025

[4] [4]

Advances in neural information processing systems32(2019)

Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems32(2019)

work page 2019

[5] [5]

Sakib et al

Dervovic, D., Lécué, F., Marchesotti, N., Magazzeni, D.: Are logistic models really interpretable? In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) (2024) 14 A.N.M. Sakib et al

work page 2024

[6] [6]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019

[7] [7]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Fel, T., Vigouroux, D., Cadène, R., Serre, T.: How good is your explanation? algorithmic stability measures to assess the quality of explanations for deep neural networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 720–730 (2022)

work page 2022

[8] [8]

Advances in neural information processing systems34, 18395–18407 (2021)

Ghalebikesabi, S., Ter-Minassian, L., DiazOrdaz, K., Holmes, C.C.: On locality of local explanation models. Advances in neural information processing systems34, 18395–18407 (2021)

work page 2021

[9] [9]

Han,B.:Trustworthymachinelearningunderimperfectdata.In:Proceedingsofthe Thirty-Third International Joint Conference on Artificial Intelligence. pp. 8535– 8540 (2024)

work page 2024

[10] [10]

In: Proceedings of the AAAI conference on artificial intelligence

Han, T., Tu, W.W., Li, Y.F.: Explanation consistency training: Facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 7639–7646 (2021)

work page 2021

[11] [11]

Advances in neural information processing systems35, 5256–5268 (2022)

Han, T., Srinivas, S., Lakkaraju, H.: Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in neural information processing systems35, 5256–5268 (2022)

work page 2022

[12] [12]

12907–12915 (2023)

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Seat: stable and explainable attention.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence.vol.37, pp. 12907–12915 (2023)

work page 2023

[13] [13]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Kraus, K., Kroll, M.: Maximizing signal in human-model preference alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 27392– 27400 (2025)

work page 2025

[14] [14]

Advances in neural information processing systems30(2017)

Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Advances in neural information processing systems30(2017)

work page 2017

[15] [15]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence

Li, T.: Scalable and trustworthy learning in heterogeneous networks. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28715–28715 (2025)

work page 2025

[16] [16]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[17] [17]

Advances in Neural Information Processing Systems pp

Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems pp. 4765–4774 (2017)

work page 2017

[18] [18]

In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies

Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. pp. 142– 150 (2011)

work page 2011

[19] [19]

In: Proceedings of the AAAI Symposium Series

Mahmud, S., Saisubramanian, S., Zilberstein, S.: Verification and validation of ai systems using explanations. In: Proceedings of the AAAI Symposium Series. vol. 4, pp. 76–80 (2024)

work page 2024

[20] [20]

Communications of the ACM 38(11), 39–41 (1995)

Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

work page 1995

[21] [21]

ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

Moraffah, R., Karami, M., Guo, R., Raglin, A., Liu, H.: Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter22(1), 18–33 (2020) Rationale Stability Under Controlled Perturbations 15

work page 2020

[22] [22]

Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

Nguyen, G., Kim, D., Nguyen, A.: The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Advances in Neural Informa- tion Processing Systems34, 26422–26436 (2021)

work page 2021

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pillai, V., Koohpayegani, S.A., Ouligian, A., Fong, D., Pirsiavash, H.: Consistent explanations by contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10213–10222 (2022)

work page 2022

[24] [24]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2431– 2439 (2021)

work page 2021

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Qian, W., Zhao, C., Li, Y., Ma, F., Zhang, C., Huai, M.: Towards modeling uncer- tainties of self-explaining neural networks via conformal prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14651–14659 (2024)

work page 2024

[26] [26]

why should i trust you?

Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining. pp. 1135–1144 (2016)

work page 2016

[27] [27]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

work page internal anchor Pith review arXiv 1910

[28] [28]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

work page 2017

[29] [29]

Advances in neural information processing systems34, 9391–9404 (2021)

Slack, D., Hilgard, A., Singh, S., Lakkaraju, H.: Reliable post hoc explanations: Modeling uncertainty in explainability. Advances in neural information processing systems34, 9391–9404 (2021)

work page 2021

[30] [30]

In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. pp. 180–186 (2020)

work page 2020

[31] [31]

In: Proceedings of the 2013 conference on empirical methods in natural language processing

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing. pp. 1631–1642 (2013)

work page 2013

[32] [32]

Steck, H., Ekanadham, C., Kallus, N.: Is cosine-similarity of embeddings really about similarity? In: Companion Proceedings of the ACM Web Conference 2024. pp. 887–890 (2024)

work page 2024

[33] [33]

Advances in Neural Information Processing Systems34, 12966–12977 (2021)

Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label clean- ing with example-based explanations. Advances in Neural Information Processing Systems34, 12966–12977 (2021)

work page 2021

[34] [34]

In: Proceedings of the IEEE/CVF international conference on computer vision

Wang, J., Liu, H., Wang, X., Jing, L.: Interpretable image recognition by construct- ing transparent embedding space. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 895–904 (2021)

work page 2021

[35] [35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting cnns via decision trees. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 6261–6270. IEEE (2019)

work page 2019

[36] [36]

In: The Thirteenth International Conference on Learning Representations (2025)

Zheng,X.,Shirani,F.,Chen,Z.,Lin,C.,Cheng,W.,Guo,W.,Luo,D.:F-fidelity:A robust framework for faithfulness evaluation of explainable AI. In: The Thirteenth International Conference on Learning Representations (2025)

work page 2025

[37] [37]

In: The Twelfth International Conference on Learning Representations (2024)

Zheng, X., Shirani, F., Wang, T., Cheng, W., Chen, Z., Chen, H., Wei, H., Luo, D.: Towards robust fidelity for evaluating explainability of graph neural networks. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024