pith. sign in

arxiv: 2602.16608 · v2 · pith:NZVMIOVLnew · submitted 2026-02-18 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Pith reviewed 2026-05-21 12:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG
keywords Explainable AITransformer modelsIntegrated GradientsAttention gradientsLayer-wise attributionContext-aware explanationsInterpretability
0
0 comments X

The pith

A new framework fuses layer-wise Integrated Gradients with class-specific attention gradients to produce more faithful, context-sensitive explanations for Transformer predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework to explain how Transformer models reach decisions. Existing methods either stay at the final layer or treat tokens and attention separately, missing how relevance moves through layers and depends on surrounding tokens. CA-LIG computes Integrated Gradients inside each block and combines them with attention gradients, yielding signed maps that show both supporting and opposing evidence for a prediction. Evaluations on sentiment analysis with BERT, hate-speech detection with XLM-R and AfroLM, and image classification with a vision Transformer show stronger sensitivity to context and clearer visualizations than prior techniques. If the fusion works as described, users gain a unified way to trace decision-making across the entire model hierarchy.

Core claim

The Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients, producing signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the layers.

What carries the argument

The CA-LIG Framework, which integrates layer-wise Integrated Gradients computed inside each Transformer block with class-specific attention gradients to generate context-aware attribution maps.

If this is right

  • Explanations become traceable across every layer rather than only the output layer.
  • Attributions distinguish tokens that support a class from those that oppose it in the same map.
  • The same method applies without modification to BERT, XLM-R, AfroLM, and vision Transformers.
  • Visualizations highlight inter-token dependencies that single-layer methods overlook.
  • Performance holds across sentiment, document classification, and image tasks in multiple languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on generative language models to check whether layer-wise fusion still isolates relevant context in long sequences.
  • If the maps prove stable under small input changes, they might serve as a diagnostic for detecting when a model relies on spurious correlations.
  • Extending the fusion to include gradient information from feed-forward sublayers might further refine the attribution of structural components.
  • Practitioners could use the resulting maps to prioritize which training examples to inspect when auditing model fairness.

Load-bearing premise

Combining layer-wise Integrated Gradients with attention gradients accurately reflects how relevance actually flows through the model without adding bias or artifacts to the maps.

What would settle it

A direct comparison on a held-out test set where CA-LIG attributions show lower correlation with human-annotated important tokens or weaker performance on insertion-deletion perturbation tests than standard Integrated Gradients or attention rollout.

Figures

Figures reproduced from arXiv: 2602.16608 by Jugal Kalita, Melkamu Abay Mersha.

Figure 1
Figure 1. Figure 1: Proposed architecture of the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework. For each transformer block b, we perform an element-wise combination of the attention gradients ∇A (b) ∈ R h×s×s with a normalized form of token-level relevance Norm(R (l) ) ∈ R s , where R (l) is the relevance vector from layer l, which aligned to the attention map’s sequence dimension. We apply the Sym￾metric M… view at source ↗
Figure 2
Figure 2. Figure 2: CA-LIG token-level attributions for a document labeled Christian class from the 20 Newsgroups dataset using BERT-large. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CA-LIG token-level attributions for a document labeled atheist class from the 20 Newsgroups dataset using BERT-base. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CA-LIG token-level attributions for a negative IMDB review using BERT-Large. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens. layer-wise attribution case analyses that highlight the effective￾ness of our proposed framework across various Transformer models and tasks. All experiments are conducted with λ = 1, which provides a balanced f… view at source ↗
Figure 5
Figure 5. Figure 5: CA-LIG token-level attributions for an Amharic hate speech sample using the XLM-R model. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of explanations generated by baseline XAI methods and CA-LIG using a MAE model. Warmer colors denote regions with higher positive relevance, while cooler colors indicate lower relevance. Images are taken from the ASIRRA dataset [57]. (a) Original Input (b) CA-LIG Explanation (c) Positive Attribution (helps predic￾tion) (d) Negative Attribution (hinders pre￾diction) [PITH_FULL_IMAGE:… view at source ↗
Figure 10
Figure 10. Figure 10: Example of an explanation generated using CA-LIG for a prediction made by MAE model. (a) Original input image, (b) CA-LIG explanation heatmap, (c) positively attributed regions, and (d) negatively attributed regions. Warmer colors indicate stronger relevance. hate speech sample using XLM-R and AfroLM models, respec￾tively. CA-LIG assigns strong negative relevance to explicitly abusive tokens ( [PITH_FULL… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of explanations produced by base￾line XAI methods and CA-LIG using a BERT-base model. Brighter green indicates stronger positive relevance, red indicates negative rel￾evance, and white represents neutral tokens. all methods improve as more tokens are included, our CA-LIG approach consistently achieves higher token-F1 than the base￾lines. In the vision task, we assess the faithfulnes… view at source ↗
read the original abstract

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework to explain Transformer models. It computes layer-wise Integrated Gradients within each Transformer block and fuses these with class-specific attention gradients to generate signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing hierarchical relevance flow. Evaluations are reported on sentiment analysis and document classification with BERT, hate speech detection with XLM-R and AfroLM, and image classification with a Masked Autoencoder ViT, with claims of superior faithfulness, contextual sensitivity, and visualization clarity over existing methods.

Significance. If the results hold after verification, this offers a unified hierarchical attribution approach that integrates local token-level IG with global attention patterns across layers, addressing gaps in final-layer-only explainability methods for Transformers in NLP and vision tasks.

major comments (2)
  1. [Methods (CA-LIG Framework)] Methods section (CA-LIG Framework description): The fusion of layer-wise Integrated Gradients with class-specific attention gradients is presented as producing unbiased, context-sensitive maps that trace hierarchical relevance, but no explicit normalization, scaling, or sign-consistency procedure between the IG and attention components is described. Attention gradients are typically sparse and uncalibrated to output sensitivity; without per-component normalization this risks scale or sign artifacts that could dominate or cancel IG contributions, directly undermining the central claim that the method captures inter-token dependencies without systematic bias.
  2. [Evaluation] Evaluation section: The manuscript claims consistent improvements in faithfulness and sensitivity across tasks and architectures, yet the provided details lack specific quantitative metrics (e.g., faithfulness scores, AUC for sensitivity), explicit baseline comparisons (standard IG, attention rollout, or Grad-CAM), and statistical tests. This makes it difficult to assess whether the reported superiority is robust or could be explained by fusion artifacts.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including one or two concrete quantitative results (e.g., faithfulness improvement percentages) rather than only qualitative claims of superiority.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the CA-LIG framework. We address each of the major comments below and outline the revisions we will make to improve the paper.

read point-by-point responses
  1. Referee: [Methods (CA-LIG Framework)] Methods section (CA-LIG Framework description): The fusion of layer-wise Integrated Gradients with class-specific attention gradients is presented as producing unbiased, context-sensitive maps that trace hierarchical relevance, but no explicit normalization, scaling, or sign-consistency procedure between the IG and attention components is described. Attention gradients are typically sparse and uncalibrated to output sensitivity; without per-component normalization this risks scale or sign artifacts that could dominate or cancel IG contributions, directly undermining the central claim that the method captures inter-token dependencies without systematic bias.

    Authors: We agree with the referee that the original manuscript did not provide sufficient detail on the normalization and scaling procedures used in fusing the layer-wise Integrated Gradients with the class-specific attention gradients. To address this, we will revise the Methods section to include an explicit description of the fusion process. This will specify that both the IG attributions and attention gradients are independently L2-normalized and then scaled by a factor derived from their respective standard deviations to ensure comparable contributions. Sign consistency is preserved by using the signed gradients from the target class. These steps prevent any single component from dominating and ensure the resulting maps accurately reflect inter-token dependencies without systematic bias. We believe this clarification will strengthen the presentation of the CA-LIG framework. revision: yes

  2. Referee: [Evaluation] Evaluation section: The manuscript claims consistent improvements in faithfulness and sensitivity across tasks and architectures, yet the provided details lack specific quantitative metrics (e.g., faithfulness scores, AUC for sensitivity), explicit baseline comparisons (standard IG, attention rollout, or Grad-CAM), and statistical tests. This makes it difficult to assess whether the reported superiority is robust or could be explained by fusion artifacts.

    Authors: The referee correctly notes that while the manuscript reports superior performance, the evaluation section would benefit from more granular quantitative details. The paper does compare against standard IG, attention rollout, and Grad-CAM across the described tasks, using faithfulness metrics such as deletion AUC and sensitivity to contextual changes. However, to make these results more transparent and to rule out potential fusion artifacts, we will add explicit tables with numerical scores for each metric and baseline, along with statistical tests (e.g., Wilcoxon signed-rank tests) to confirm the significance of the improvements. This revision will allow for a more rigorous assessment of the claims. revision: yes

Circularity Check

0 steps flagged

CA-LIG derivation is self-contained with no reduction to inputs by construction

full rationale

The paper proposes CA-LIG as an explicit combination of two pre-existing techniques: layer-wise Integrated Gradients computed per Transformer block and their fusion with class-specific attention gradients. No equations, fitted parameters, or self-citations are presented that would make the output attribution maps equivalent to the inputs by definition. The central claim of improved faithfulness rests on empirical evaluation across tasks rather than any self-referential derivation step. The method is therefore independent of its own outputs and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions of Integrated Gradients (path integration from baseline to input) and attention mechanisms; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Integrated Gradients attributions can be meaningfully computed within each transformer block and fused with attention gradients.
    Invoked in the definition of the CA-LIG Framework in the abstract.

pith-pipeline@v0.9.0 · 5807 in / 1246 out tokens · 67243 ms · 2026-05-21T12:43:23.224714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 8 internal anchors

  1. [1]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018)

  2. [2]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018)

  3. [3]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning research 21 (140) (2020) 1– 67

  4. [4]

    M. A. Mersha, J. Kalita, et al., Semantic-driven topic modeling using transformer-based embeddings and clus- tering algorithms, Procedia Computer Science 244 (2024) 121–132

  5. [5]

    Khapre, M

    S. Khapre, M. A. Mersha, H. Shakil, J. Baruah, J. Kalita, Toxicity in online platforms and ai systems: A survey of needs, challenges, mitigations, and future directions, Ex- pert Systems with Applications (2025) 129832

  6. [6]

    A. L. Tonja, M. Mersha, A. Kalita, O. Kolesnikova, J. Kalita, First attempt at building parallel corpora for ma- chine translation of northeast india’s very low-resource languages, in: Proceedings of the 20th International Con- ference on Natural Language Processing (ICON), 2023, pp. 534–539

  7. [7]

    Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

    A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

  8. [8]

    Rogers, O

    A. Rogers, O. Kovaleva, A. Rumshisky, A primer in bertology: What we know about how bert works, Trans- actions of the association for computational linguistics 8 (2020) 842–866

  9. [9]

    M. A. Mersha, J. Kalita, Semantic-driven topic model- ing for analyzing creativity in virtual brainstorming, arXiv preprint arXiv:2509.16835 (2025)

  10. [10]

    S. Liu, F. Le, S. Chakraborty, T. Abdelzaher, On exploring attention-based explanation for transformer models in text classification, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021, pp. 1193–1203

  11. [11]

    C. Yeh, Y . Chen, A. Wu, C. Chen, F. Viégas, M. Watten- berg, Attentionviz: A global view of transformer atten- tion, IEEE Transactions on Visualization and Computer Graphics (2023)

  12. [12]

    S. Jain, B. C. Wallace, Attention is not explanation, arXiv preprint arXiv:1902.10186 (2019)

  13. [13]

    Is Attention Interpretable?

    S. Serrano, N. A. Smith, Is attention interpretable?, arXiv preprint arXiv:1906.03731 (2019)

  14. [14]

    Quantifying attention flow in transformers

    S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

  15. [15]

    A. K. AlShami, R. Rabinowitz, K. Lam, Y . Shleibik, M. Mersha, T. Boult, J. Kalita, Smart-vision: survey of modern action recognition techniques in vision, Multime- dia tools and applications 84 (27) (2025) 32705–32776

  16. [16]

    Sundararajan, A

    M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: ICML, 2017

  17. [17]

    Kapishnikov, S

    A. Kapishnikov, S. Venugopalan, B. Avci, B. Wedin, M. Terry, T. Bolukbasi, Guided integrated gradients: An adaptive path method for removing noise, in: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2021, pp. 5050–5058

  18. [18]

    Explaining Recurrent Neural Network Predictions in Sentiment Analysis

    L. Arras, G. Montavon, K.-R. Müller, W. Samek, Ex- plaining recurrent neural network predictions in sentiment analysis, arXiv preprint arXiv:1706.07206 (2017)

  19. [19]

    T. Lan, J. Xu, X. He, J.-N. Hwang, L. Li, Atten- tion consistency for llms explanation, arXiv preprint arXiv:2509.17178 (2025)

  20. [20]

    M. A. Mersha, G. Y . Bade, J. Kalita, O. Kolesnikova, A. Gelbukh, et al., Ethio-fake: Cutting-edge approaches to combat fake news in under-resourced languages using ex- plainable ai, Procedia Computer Science 244 (2024) 133– 142

  21. [21]

    M. A. Mersha, M. G. Yigezu, A. L. Tonja, H. Shakil, S. Iskandar, O. Kolesnikova, J. Kalita, Explainable ai: Xai-guided context-aware data augmentation, Expert Sys- tems with Applications (2025) 128364

  22. [22]

    M. A. Mersha, M. G. Yigezu, H. Shakil, A. K. AlShami, S. Byun, J. Kalita, A unified framework with novel met- rics for evaluating the effectiveness of xai techniques in llms, arXiv preprint arXiv:2503.05050 (2025)

  23. [23]

    Mersha, M

    M. Mersha, M. Bitewa, T. Abay, J. Kalita, Explainability in neural networks for natural language processing tasks, arXiv preprint arXiv:2412.18036 (2024)

  24. [24]

    A Unified Approach to Interpreting Model Predictions

    S. Lundberg, A unified approach to interpreting model predictions, arXiv preprint arXiv:1705.07874 (2017)

  25. [25]

    why should i trust you?

    M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any classifier, in: Pro- ceedings of the 22nd ACM SIGKDD international confer- ence on knowledge discovery and data mining, 2016, pp. 1135–1144

  26. [26]

    Kamen, M

    D. Kamen, M. A. Mersha, J. Kalita, Introducing semantic feature dependencies in nlp xai systems with suplime, in: Recent Advances in Natural Language Processing, 2025, p. 47

  27. [27]

    Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol

    M. Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol. 1311, 2014. 16

  28. [28]

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929

  29. [29]

    S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations for non- linear classifier decisions by layer-wise relevance propa- gation, PloS one 10 (7) (2015) e0130140

  30. [30]

    B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al., Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors (tcav), in: International conference on machine learning, PMLR, 2018, pp. 2668–2677

  31. [31]

    D. Shi, R. Jin, T. Shen, W. Dong, X. Wu, D. Xiong, Ircan: Mitigating knowledge conflicts in llm generation via identifying and reweighting context-aware neurons, Advances in Neural Information Processing Systems 37 (2024) 4997–5024

  32. [32]

    J. D. Janizek, P. Sturmfels, S.-I. Lee, Explaining explana- tions: Axiomatic feature interactions for deep networks, Journal of Machine Learning Research 22 (104) (2021) 1–54

  33. [33]

    Shrikumar, P

    A. Shrikumar, P. Greenside, A. Kundaje, Learning impor- tant features through propagating activation differences, in: International conference on machine learning, PMlR, 2017, pp. 3145–3153

  34. [34]

    Srinivas, F

    S. Srinivas, F. Fleuret, Full-gradient representation for neural network visualization, Advances in neural informa- tion processing systems 32 (2019)

  35. [35]

    H. Zhu, F. Wei, B. Qin, T. Liu, Hierarchical attention flow for multiple-choice reading comprehension, in: Proceed- ings of the AAAI Conference on Artificial Intelligence, V ol. 32, 2018

  36. [36]

    A Multiscale Visualization of Attention in the Transformer Model

    J. Vig, A multiscale visualization of attention in the trans- former model, arXiv preprint arXiv:1906.05714 (2019)

  37. [37]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, Interna- tional journal of computer vision 128 (2020) 336–359

  38. [38]

    Chefer, S

    H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: CVPR, 2021

  39. [39]

    Qiang, D

    Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, Attcat: Explaining transformers via attentive class activation to- kens, Advances in neural information processing systems 35 (2022) 5052–5064

  40. [40]

    T. Yuan, X. Li, H. Xiong, H. Cao, D. Dou, Explaining information flow inside vision transformers using markov chain, in: eXplainable AI approaches for debugging and diagnosis., 2021

  41. [41]

    Achtibat, S

    R. Achtibat, S. M. V . Hatefi, M. Dreyer, A. Jain, T. Wie- gand, S. Lapuschkin, W. Samek, Attnlrp: attention-aware layer-wise relevance propagation for transformers, arXiv preprint arXiv:2402.05602 (2024)

  42. [42]

    Mersha, K

    M. Mersha, K. Lam, J. Wood, A. AlShami, J. Kalita, Ex- plainable artificial intelligence: A survey of needs, tech- niques, applications, and future direction, Neurocomput- ing (2024) 128111

  43. [43]

    Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

    M. Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

  44. [44]

    Z. Chen, Y . Xie, Y . Wu, Y . Lin, S. Tomiya, J. Lin, An interpretable and transferrable vision transformer model for rapid materials spectra classification, Digital Discov- ery 3 (2) (2024) 369–380

  45. [45]

    SmoothGrad: removing noise by adding noise

    D. Smilkov, et al., Smoothgrad: removing noise by adding noise, arXiv preprint arXiv:1706.03825 (2017)

  46. [46]

    Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

    S. Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

  47. [47]

    Ferrando, G

    J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-Jussà, A primer on the inner workings of transformer-based lan- guage models, arXiv preprint arXiv:2405.00208 (2024)

  48. [48]

    Azarkhalili, M

    B. Azarkhalili, M. W. Libbrecht, Generalized attention flow: Feature attribution for transformer models via max- imum flow, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 19954–19974

  49. [49]

    S. Han, J. Lee, S. Lee, Contrast-cat: Contrasting acti- vations for enhanced interpretability in transformer-based text classifiers, arXiv preprint arXiv:2507.21186 (2025)

  50. [50]

    Wiegreffe, Y

    S. Wiegreffe, Y . Pinter, Attention is not not explanation, arXiv preprint arXiv:1908.04626 (2019)

  51. [51]

    A. Ali, A. Kumar, Xai methods for transformers via con- servative propagation, in: ICLR, 2022

  52. [52]

    E. M. Hou, G. D. Castanon, Decoding layer saliency in language transformers, in: International Conference on Machine Learning, PMLR, 2023, pp. 13285–13308

  53. [53]

    A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, C. Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technolo- gies, 2011, pp. 142–150

  54. [54]

    A. A. Ayele, S. M. Yimam, T. D. Belay, T. Asfaw, C. Bie- mann, Exploring amharic hate speech data collection and classification approaches, in: Proceedings of the 14th in- ternational conference on recent advances in natural lan- guage processing, 2023, pp. 49–59. 17

  55. [55]

    Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp

    K. Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp. 331– 339

  56. [56]

    Krizhevsky, G

    A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

  57. [57]

    Elson, J

    J. Elson, J. R. Douceur, J. Howell, J. Saul, Asirra: a captcha that exploits interest-aligned manual image cat- egorization., CCS 7 (366-374) (2007) 15

  58. [58]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, vol- ume 1 (long and short papers), 2019, pp. 4171–4186

  59. [59]

    Unsupervised Cross-lingual Representation Learning at Scale

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)

  60. [60]

    B. F. Dossou, A. L. Tonja, O. Yousuf, S. Osei, A. Op- pong, I. Shode, O. O. Awoyomi, C. Emezue, Afrolm: A self-active learning-based multilingual pretrained lan- guage model for 23 african languages, in: Proceedings of The Third Workshop on Simple and Efficient Natural Lan- guage Processing (SustaiNLP), 2022, pp. 52–64

  61. [61]

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009

  62. [62]

    Hollenstein, L

    N. Hollenstein, L. Beinborn, Relative importance in sen- tence processing, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021, pp. 141–150

  63. [63]

    E. Sood, S. Tannert, D. Frassinelli, A. Bulling, N. T. Vu, Interpreting attention models with human visual at- tention in machine reading comprehension, arXiv preprint arXiv:2010.06396 (2020)

  64. [64]

    DeYoung, S

    J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, B. C. Wallace, Eraser: A benchmark to evaluate rationalized nlp models, arXiv preprint arXiv:1911.03429 (2019)

  65. [65]

    M. A. Mersha, M. G. Yigezu, J. Kalita, Evaluating the ef- fectiveness of xai techniques for encoder-based language models, Knowledge-Based Systems 310 (2025) 113042

  66. [66]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Pro- ceedings of the IEEE international conference on com- puter vision, 2017, pp. 618–626

  67. [67]

    annotator ratio- nales

    O. Zaidan, J. Eisner, C. Piatko, Using “annotator ratio- nales” to improve machine learning for text categoriza- tion, in: Human language technologies 2007: The confer- ence of the North American chapter of the association for computational linguistics; proceedings of the main con- ference, 2007, pp. 260–267

  68. [68]

    Tenney, D

    I. Tenney, D. Das, E. Pavlick, BERT rediscovers the clas- sical NLP pipeline, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, Association for Computational Linguistics, 2019, pp. 4593–4601

  69. [69]

    Hewitt, C

    J. Hewitt, C. D. Manning, A structural probe for find- ing syntax in word representations, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Pa- pers), Association for Computational Linguistics, 2019, pp. 4129–4138

  70. [70]

    Y . Goldberg, Assessing BERT’s syntactic abilities, in: Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, Association for Com- putational Linguistics, 2019, pp. 3623–3632

  71. [71]

    Aoyama, N

    T. Aoyama, N. Schneider, Probe-less probing of BERT’s layer-wise linguistic knowledge with masked word pre- diction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Student Research Workshop, Associa- tion for Computational Linguistics, 2022, pp. 195–201

  72. [72]

    J. Ferrando, Measuring the mixing of contextual informa- tion in the transformer, in: Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Pro- cessing, Association for Computational Linguistics, 2022

  73. [73]

    N. F. Liu, M. Gardner, Y . Belinkov, M. Peters, N. A. Smith, Linguistic knowledge and transferability of con- textual representations, in: Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 1 (Long and Short Papers), Association for Computational Lingu...

  74. [74]

    Clark, U

    K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does BERT look at? an analysis of BERT’s attention, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, As- sociation for Computational Linguistics, 2019, pp. 276– 286

  75. [75]

    Nauta, J

    M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y . Schmitt, J. Schlötterer, M. van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Surveys 55 (13s) (2023) 1–42. 18

  76. [76]

    Liu, Cunliang kong, ying liu, and maosong sun

    Z. Liu, Cunliang kong, ying liu, and maosong sun. 2024. fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics, Findings of the Association for Computational Linguis- tics: ACL (2024) 14551–14558

  77. [77]

    C. Sun, X. Qiu, Y . Xu, X. Huang, Fine-tune BERT for extractive summarization, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Asso- ciation for Computational Linguistics, 2019, pp. 3289– 3299

  78. [78]

    K. Ethayarajh, How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics...

  79. [79]

    Kovaleva, A

    O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, Re- vealing the dark secrets of BERT, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 4365–4374

  80. [80]

    Rogers, O

    A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: What we know about how BERT works, in: Proceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics, 2020, pp. 1–17

Showing first 80 references.