Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Jugal Kalita; Melkamu Abay Mersha

arxiv: 2602.16608 · v2 · pith:NZVMIOVLnew · submitted 2026-02-18 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Melkamu Abay Mersha , Jugal Kalita This is my paper

Pith reviewed 2026-05-21 12:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords Explainable AITransformer modelsIntegrated GradientsAttention gradientsLayer-wise attributionContext-aware explanationsInterpretability

0 comments

The pith

A new framework fuses layer-wise Integrated Gradients with class-specific attention gradients to produce more faithful, context-sensitive explanations for Transformer predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework to explain how Transformer models reach decisions. Existing methods either stay at the final layer or treat tokens and attention separately, missing how relevance moves through layers and depends on surrounding tokens. CA-LIG computes Integrated Gradients inside each block and combines them with attention gradients, yielding signed maps that show both supporting and opposing evidence for a prediction. Evaluations on sentiment analysis with BERT, hate-speech detection with XLM-R and AfroLM, and image classification with a vision Transformer show stronger sensitivity to context and clearer visualizations than prior techniques. If the fusion works as described, users gain a unified way to trace decision-making across the entire model hierarchy.

Core claim

The Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients, producing signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the layers.

What carries the argument

The CA-LIG Framework, which integrates layer-wise Integrated Gradients computed inside each Transformer block with class-specific attention gradients to generate context-aware attribution maps.

If this is right

Explanations become traceable across every layer rather than only the output layer.
Attributions distinguish tokens that support a class from those that oppose it in the same map.
The same method applies without modification to BERT, XLM-R, AfroLM, and vision Transformers.
Visualizations highlight inter-token dependencies that single-layer methods overlook.
Performance holds across sentiment, document classification, and image tasks in multiple languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on generative language models to check whether layer-wise fusion still isolates relevant context in long sequences.
If the maps prove stable under small input changes, they might serve as a diagnostic for detecting when a model relies on spurious correlations.
Extending the fusion to include gradient information from feed-forward sublayers might further refine the attribution of structural components.
Practitioners could use the resulting maps to prioritize which training examples to inspect when auditing model fairness.

Load-bearing premise

Combining layer-wise Integrated Gradients with attention gradients accurately reflects how relevance actually flows through the model without adding bias or artifacts to the maps.

What would settle it

A direct comparison on a held-out test set where CA-LIG attributions show lower correlation with human-annotated important tokens or weaker performance on insertion-deletion perturbation tests than standard Integrated Gradients or attention rollout.

Figures

Figures reproduced from arXiv: 2602.16608 by Jugal Kalita, Melkamu Abay Mersha.

**Figure 1.** Figure 1: Proposed architecture of the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework. For each transformer block b, we perform an element-wise combination of the attention gradients ∇A (b) ∈ R h×s×s with a normalized form of token-level relevance Norm(R (l) ) ∈ R s , where R (l) is the relevance vector from layer l, which aligned to the attention map’s sequence dimension. We apply the Symmetric M… view at source ↗

**Figure 2.** Figure 2: CA-LIG token-level attributions for a document labeled Christian class from the 20 Newsgroups dataset using BERT-large. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: CA-LIG token-level attributions for a document labeled atheist class from the 20 Newsgroups dataset using BERT-base. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: CA-LIG token-level attributions for a negative IMDB review using BERT-Large. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens. layer-wise attribution case analyses that highlight the effectiveness of our proposed framework across various Transformer models and tasks. All experiments are conducted with λ = 1, which provides a balanced f… view at source ↗

**Figure 5.** Figure 5: CA-LIG token-level attributions for an Amharic hate speech sample using the XLM-R model. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of explanations generated by baseline XAI methods and CA-LIG using a MAE model. Warmer colors denote regions with higher positive relevance, while cooler colors indicate lower relevance. Images are taken from the ASIRRA dataset [57]. (a) Original Input (b) CA-LIG Explanation (c) Positive Attribution (helps prediction) (d) Negative Attribution (hinders prediction) [PITH_FULL_IMAGE:… view at source ↗

**Figure 10.** Figure 10: Example of an explanation generated using CA-LIG for a prediction made by MAE model. (a) Original input image, (b) CA-LIG explanation heatmap, (c) positively attributed regions, and (d) negatively attributed regions. Warmer colors indicate stronger relevance. hate speech sample using XLM-R and AfroLM models, respectively. CA-LIG assigns strong negative relevance to explicitly abusive tokens ( [PITH_FULL… view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of explanations produced by baseline XAI methods and CA-LIG using a BERT-base model. Brighter green indicates stronger positive relevance, red indicates negative relevance, and white represents neutral tokens. all methods improve as more tokens are included, our CA-LIG approach consistently achieves higher token-F1 than the baselines. In the vision task, we assess the faithfulnes… view at source ↗

read the original abstract

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework to explain Transformer models. It computes layer-wise Integrated Gradients within each Transformer block and fuses these with class-specific attention gradients to generate signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing hierarchical relevance flow. Evaluations are reported on sentiment analysis and document classification with BERT, hate speech detection with XLM-R and AfroLM, and image classification with a Masked Autoencoder ViT, with claims of superior faithfulness, contextual sensitivity, and visualization clarity over existing methods.

Significance. If the results hold after verification, this offers a unified hierarchical attribution approach that integrates local token-level IG with global attention patterns across layers, addressing gaps in final-layer-only explainability methods for Transformers in NLP and vision tasks.

major comments (2)

[Methods (CA-LIG Framework)] Methods section (CA-LIG Framework description): The fusion of layer-wise Integrated Gradients with class-specific attention gradients is presented as producing unbiased, context-sensitive maps that trace hierarchical relevance, but no explicit normalization, scaling, or sign-consistency procedure between the IG and attention components is described. Attention gradients are typically sparse and uncalibrated to output sensitivity; without per-component normalization this risks scale or sign artifacts that could dominate or cancel IG contributions, directly undermining the central claim that the method captures inter-token dependencies without systematic bias.
[Evaluation] Evaluation section: The manuscript claims consistent improvements in faithfulness and sensitivity across tasks and architectures, yet the provided details lack specific quantitative metrics (e.g., faithfulness scores, AUC for sensitivity), explicit baseline comparisons (standard IG, attention rollout, or Grad-CAM), and statistical tests. This makes it difficult to assess whether the reported superiority is robust or could be explained by fusion artifacts.

minor comments (1)

[Abstract] The abstract would be strengthened by including one or two concrete quantitative results (e.g., faithfulness improvement percentages) rather than only qualitative claims of superiority.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the CA-LIG framework. We address each of the major comments below and outline the revisions we will make to improve the paper.

read point-by-point responses

Referee: [Methods (CA-LIG Framework)] Methods section (CA-LIG Framework description): The fusion of layer-wise Integrated Gradients with class-specific attention gradients is presented as producing unbiased, context-sensitive maps that trace hierarchical relevance, but no explicit normalization, scaling, or sign-consistency procedure between the IG and attention components is described. Attention gradients are typically sparse and uncalibrated to output sensitivity; without per-component normalization this risks scale or sign artifacts that could dominate or cancel IG contributions, directly undermining the central claim that the method captures inter-token dependencies without systematic bias.

Authors: We agree with the referee that the original manuscript did not provide sufficient detail on the normalization and scaling procedures used in fusing the layer-wise Integrated Gradients with the class-specific attention gradients. To address this, we will revise the Methods section to include an explicit description of the fusion process. This will specify that both the IG attributions and attention gradients are independently L2-normalized and then scaled by a factor derived from their respective standard deviations to ensure comparable contributions. Sign consistency is preserved by using the signed gradients from the target class. These steps prevent any single component from dominating and ensure the resulting maps accurately reflect inter-token dependencies without systematic bias. We believe this clarification will strengthen the presentation of the CA-LIG framework. revision: yes
Referee: [Evaluation] Evaluation section: The manuscript claims consistent improvements in faithfulness and sensitivity across tasks and architectures, yet the provided details lack specific quantitative metrics (e.g., faithfulness scores, AUC for sensitivity), explicit baseline comparisons (standard IG, attention rollout, or Grad-CAM), and statistical tests. This makes it difficult to assess whether the reported superiority is robust or could be explained by fusion artifacts.

Authors: The referee correctly notes that while the manuscript reports superior performance, the evaluation section would benefit from more granular quantitative details. The paper does compare against standard IG, attention rollout, and Grad-CAM across the described tasks, using faithfulness metrics such as deletion AUC and sensitivity to contextual changes. However, to make these results more transparent and to rule out potential fusion artifacts, we will add explicit tables with numerical scores for each metric and baseline, along with statistical tests (e.g., Wilcoxon signed-rank tests) to confirm the significance of the improvements. This revision will allow for a more rigorous assessment of the claims. revision: yes

Circularity Check

0 steps flagged

CA-LIG derivation is self-contained with no reduction to inputs by construction

full rationale

The paper proposes CA-LIG as an explicit combination of two pre-existing techniques: layer-wise Integrated Gradients computed per Transformer block and their fusion with class-specific attention gradients. No equations, fitted parameters, or self-citations are presented that would make the output attribution maps equivalent to the inputs by definition. The central claim of improved faithfulness rests on empirical evaluation across tasks rather than any self-referential derivation step. The method is therefore independent of its own outputs and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions of Integrated Gradients (path integration from baseline to input) and attention mechanisms; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Integrated Gradients attributions can be meaningfully computed within each transformer block and fused with attention gradients.
Invoked in the definition of the CA-LIG Framework in the abstract.

pith-pipeline@v0.9.0 · 5807 in / 1246 out tokens · 67243 ms · 2026-05-21T12:43:23.224714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 8 internal anchors

[1]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018)

work page 2018
[3]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning research 21 (140) (2020) 1– 67

work page 2020
[4]

M. A. Mersha, J. Kalita, et al., Semantic-driven topic modeling using transformer-based embeddings and clus- tering algorithms, Procedia Computer Science 244 (2024) 121–132

work page 2024
[5]

Khapre, M

S. Khapre, M. A. Mersha, H. Shakil, J. Baruah, J. Kalita, Toxicity in online platforms and ai systems: A survey of needs, challenges, mitigations, and future directions, Ex- pert Systems with Applications (2025) 129832

work page 2025
[6]

A. L. Tonja, M. Mersha, A. Kalita, O. Kolesnikova, J. Kalita, First attempt at building parallel corpora for ma- chine translation of northeast india’s very low-resource languages, in: Proceedings of the 20th International Con- ference on Natural Language Processing (ICON), 2023, pp. 534–539

work page 2023
[7]

Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

work page 2017
[8]

Rogers, O

A. Rogers, O. Kovaleva, A. Rumshisky, A primer in bertology: What we know about how bert works, Trans- actions of the association for computational linguistics 8 (2020) 842–866

work page 2020
[9]

M. A. Mersha, J. Kalita, Semantic-driven topic model- ing for analyzing creativity in virtual brainstorming, arXiv preprint arXiv:2509.16835 (2025)

work page arXiv 2025
[10]

S. Liu, F. Le, S. Chakraborty, T. Abdelzaher, On exploring attention-based explanation for transformer models in text classification, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021, pp. 1193–1203

work page 2021
[11]

C. Yeh, Y . Chen, A. Wu, C. Chen, F. Viégas, M. Watten- berg, Attentionviz: A global view of transformer atten- tion, IEEE Transactions on Visualization and Computer Graphics (2023)

work page 2023
[12]

S. Jain, B. C. Wallace, Attention is not explanation, arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902
[13]

Is Attention Interpretable?

S. Serrano, N. A. Smith, Is attention interpretable?, arXiv preprint arXiv:1906.03731 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[14]

Quantifying attention flow in transformers

S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

work page arXiv 2005
[15]

A. K. AlShami, R. Rabinowitz, K. Lam, Y . Shleibik, M. Mersha, T. Boult, J. Kalita, Smart-vision: survey of modern action recognition techniques in vision, Multime- dia tools and applications 84 (27) (2025) 32705–32776

work page 2025
[16]

Sundararajan, A

M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: ICML, 2017

work page 2017
[17]

Kapishnikov, S

A. Kapishnikov, S. Venugopalan, B. Avci, B. Wedin, M. Terry, T. Bolukbasi, Guided integrated gradients: An adaptive path method for removing noise, in: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2021, pp. 5050–5058

work page 2021
[18]

Explaining Recurrent Neural Network Predictions in Sentiment Analysis

L. Arras, G. Montavon, K.-R. Müller, W. Samek, Ex- plaining recurrent neural network predictions in sentiment analysis, arXiv preprint arXiv:1706.07206 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

T. Lan, J. Xu, X. He, J.-N. Hwang, L. Li, Atten- tion consistency for llms explanation, arXiv preprint arXiv:2509.17178 (2025)

work page arXiv 2025
[20]

M. A. Mersha, G. Y . Bade, J. Kalita, O. Kolesnikova, A. Gelbukh, et al., Ethio-fake: Cutting-edge approaches to combat fake news in under-resourced languages using ex- plainable ai, Procedia Computer Science 244 (2024) 133– 142

work page 2024
[21]

M. A. Mersha, M. G. Yigezu, A. L. Tonja, H. Shakil, S. Iskandar, O. Kolesnikova, J. Kalita, Explainable ai: Xai-guided context-aware data augmentation, Expert Sys- tems with Applications (2025) 128364

work page 2025
[22]

M. A. Mersha, M. G. Yigezu, H. Shakil, A. K. AlShami, S. Byun, J. Kalita, A unified framework with novel met- rics for evaluating the effectiveness of xai techniques in llms, arXiv preprint arXiv:2503.05050 (2025)

work page arXiv 2025
[23]

Mersha, M

M. Mersha, M. Bitewa, T. Abay, J. Kalita, Explainability in neural networks for natural language processing tasks, arXiv preprint arXiv:2412.18036 (2024)

work page arXiv 2024
[24]

A Unified Approach to Interpreting Model Predictions

S. Lundberg, A unified approach to interpreting model predictions, arXiv preprint arXiv:1705.07874 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

why should i trust you?

M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any classifier, in: Pro- ceedings of the 22nd ACM SIGKDD international confer- ence on knowledge discovery and data mining, 2016, pp. 1135–1144

work page 2016
[26]

Kamen, M

D. Kamen, M. A. Mersha, J. Kalita, Introducing semantic feature dependencies in nlp xai systems with suplime, in: Recent Advances in Natural Language Processing, 2025, p. 47

work page 2025
[27]

Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol

M. Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol. 1311, 2014. 16

work page 2014
[28]

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929

work page 2016
[29]

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations for non- linear classifier decisions by layer-wise relevance propa- gation, PloS one 10 (7) (2015) e0130140

work page 2015
[30]

B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al., Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors (tcav), in: International conference on machine learning, PMLR, 2018, pp. 2668–2677

work page 2018
[31]

D. Shi, R. Jin, T. Shen, W. Dong, X. Wu, D. Xiong, Ircan: Mitigating knowledge conflicts in llm generation via identifying and reweighting context-aware neurons, Advances in Neural Information Processing Systems 37 (2024) 4997–5024

work page 2024
[32]

J. D. Janizek, P. Sturmfels, S.-I. Lee, Explaining explana- tions: Axiomatic feature interactions for deep networks, Journal of Machine Learning Research 22 (104) (2021) 1–54

work page 2021
[33]

Shrikumar, P

A. Shrikumar, P. Greenside, A. Kundaje, Learning impor- tant features through propagating activation differences, in: International conference on machine learning, PMlR, 2017, pp. 3145–3153

work page 2017
[34]

Srinivas, F

S. Srinivas, F. Fleuret, Full-gradient representation for neural network visualization, Advances in neural informa- tion processing systems 32 (2019)

work page 2019
[35]

H. Zhu, F. Wei, B. Qin, T. Liu, Hierarchical attention flow for multiple-choice reading comprehension, in: Proceed- ings of the AAAI Conference on Artificial Intelligence, V ol. 32, 2018

work page 2018
[36]

A Multiscale Visualization of Attention in the Transformer Model

J. Vig, A multiscale visualization of attention in the trans- former model, arXiv preprint arXiv:1906.05714 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[37]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, Interna- tional journal of computer vision 128 (2020) 336–359

work page 2020
[38]

Chefer, S

H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: CVPR, 2021

work page 2021
[39]

Qiang, D

Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, Attcat: Explaining transformers via attentive class activation to- kens, Advances in neural information processing systems 35 (2022) 5052–5064

work page 2022
[40]

T. Yuan, X. Li, H. Xiong, H. Cao, D. Dou, Explaining information flow inside vision transformers using markov chain, in: eXplainable AI approaches for debugging and diagnosis., 2021

work page 2021
[41]

Achtibat, S

R. Achtibat, S. M. V . Hatefi, M. Dreyer, A. Jain, T. Wie- gand, S. Lapuschkin, W. Samek, Attnlrp: attention-aware layer-wise relevance propagation for transformers, arXiv preprint arXiv:2402.05602 (2024)

work page arXiv 2024
[42]

Mersha, K

M. Mersha, K. Lam, J. Wood, A. AlShami, J. Kalita, Ex- plainable artificial intelligence: A survey of needs, tech- niques, applications, and future direction, Neurocomput- ing (2024) 128111

work page 2024
[43]

Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

M. Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

work page 2024
[44]

Z. Chen, Y . Xie, Y . Wu, Y . Lin, S. Tomiya, J. Lin, An interpretable and transferrable vision transformer model for rapid materials spectra classification, Digital Discov- ery 3 (2) (2024) 369–380

work page 2024
[45]

SmoothGrad: removing noise by adding noise

D. Smilkov, et al., Smoothgrad: removing noise by adding noise, arXiv preprint arXiv:1706.03825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

S. Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

work page 2023
[47]

Ferrando, G

J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-Jussà, A primer on the inner workings of transformer-based lan- guage models, arXiv preprint arXiv:2405.00208 (2024)

work page arXiv 2024
[48]

Azarkhalili, M

B. Azarkhalili, M. W. Libbrecht, Generalized attention flow: Feature attribution for transformer models via max- imum flow, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 19954–19974

work page 2025
[49]

S. Han, J. Lee, S. Lee, Contrast-cat: Contrasting acti- vations for enhanced interpretability in transformer-based text classifiers, arXiv preprint arXiv:2507.21186 (2025)

work page arXiv 2025
[50]

Wiegreffe, Y

S. Wiegreffe, Y . Pinter, Attention is not not explanation, arXiv preprint arXiv:1908.04626 (2019)

work page arXiv 1908
[51]

A. Ali, A. Kumar, Xai methods for transformers via con- servative propagation, in: ICLR, 2022

work page 2022
[52]

E. M. Hou, G. D. Castanon, Decoding layer saliency in language transformers, in: International Conference on Machine Learning, PMLR, 2023, pp. 13285–13308

work page 2023
[53]

A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, C. Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technolo- gies, 2011, pp. 142–150

work page 2011
[54]

A. A. Ayele, S. M. Yimam, T. D. Belay, T. Asfaw, C. Bie- mann, Exploring amharic hate speech data collection and classification approaches, in: Proceedings of the 14th in- ternational conference on recent advances in natural lan- guage processing, 2023, pp. 49–59. 17

work page 2023
[55]

Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp

K. Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp. 331– 339

work page 1995
[56]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

work page 2009
[57]

Elson, J

J. Elson, J. R. Douceur, J. Howell, J. Saul, Asirra: a captcha that exploits interest-aligned manual image cat- egorization., CCS 7 (366-374) (2007) 15

work page 2007
[58]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, vol- ume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[59]

Unsupervised Cross-lingual Representation Learning at Scale

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911
[60]

B. F. Dossou, A. L. Tonja, O. Yousuf, S. Osei, A. Op- pong, I. Shode, O. O. Awoyomi, C. Emezue, Afrolm: A self-active learning-based multilingual pretrained lan- guage model for 23 african languages, in: Proceedings of The Third Workshop on Simple and Efficient Natural Lan- guage Processing (SustaiNLP), 2022, pp. 52–64

work page 2022
[61]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009

work page 2022
[62]

Hollenstein, L

N. Hollenstein, L. Beinborn, Relative importance in sen- tence processing, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021, pp. 141–150

work page 2021
[63]

E. Sood, S. Tannert, D. Frassinelli, A. Bulling, N. T. Vu, Interpreting attention models with human visual at- tention in machine reading comprehension, arXiv preprint arXiv:2010.06396 (2020)

work page arXiv 2010
[64]

DeYoung, S

J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, B. C. Wallace, Eraser: A benchmark to evaluate rationalized nlp models, arXiv preprint arXiv:1911.03429 (2019)

work page arXiv 1911
[65]

M. A. Mersha, M. G. Yigezu, J. Kalita, Evaluating the ef- fectiveness of xai techniques for encoder-based language models, Knowledge-Based Systems 310 (2025) 113042

work page 2025
[66]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Pro- ceedings of the IEEE international conference on com- puter vision, 2017, pp. 618–626

work page 2017
[67]

annotator ratio- nales

O. Zaidan, J. Eisner, C. Piatko, Using “annotator ratio- nales” to improve machine learning for text categoriza- tion, in: Human language technologies 2007: The confer- ence of the North American chapter of the association for computational linguistics; proceedings of the main con- ference, 2007, pp. 260–267

work page 2007
[68]

Tenney, D

I. Tenney, D. Das, E. Pavlick, BERT rediscovers the clas- sical NLP pipeline, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, Association for Computational Linguistics, 2019, pp. 4593–4601

work page 2019
[69]

Hewitt, C

J. Hewitt, C. D. Manning, A structural probe for find- ing syntax in word representations, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Pa- pers), Association for Computational Linguistics, 2019, pp. 4129–4138

work page 2019
[70]

Y . Goldberg, Assessing BERT’s syntactic abilities, in: Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, Association for Com- putational Linguistics, 2019, pp. 3623–3632

work page 2019
[71]

Aoyama, N

T. Aoyama, N. Schneider, Probe-less probing of BERT’s layer-wise linguistic knowledge with masked word pre- diction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Student Research Workshop, Associa- tion for Computational Linguistics, 2022, pp. 195–201

work page 2022
[72]

J. Ferrando, Measuring the mixing of contextual informa- tion in the transformer, in: Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Pro- cessing, Association for Computational Linguistics, 2022

work page 2022
[73]

N. F. Liu, M. Gardner, Y . Belinkov, M. Peters, N. A. Smith, Linguistic knowledge and transferability of con- textual representations, in: Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 1 (Long and Short Papers), Association for Computational Lingu...

work page 2019
[74]

Clark, U

K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does BERT look at? an analysis of BERT’s attention, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, As- sociation for Computational Linguistics, 2019, pp. 276– 286

work page 2019
[75]

Nauta, J

M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y . Schmitt, J. Schlötterer, M. van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Surveys 55 (13s) (2023) 1–42. 18

work page 2023
[76]

Liu, Cunliang kong, ying liu, and maosong sun

Z. Liu, Cunliang kong, ying liu, and maosong sun. 2024. fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics, Findings of the Association for Computational Linguis- tics: ACL (2024) 14551–14558

work page 2024
[77]

C. Sun, X. Qiu, Y . Xu, X. Huang, Fine-tune BERT for extractive summarization, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Asso- ciation for Computational Linguistics, 2019, pp. 3289– 3299

work page 2019
[78]

K. Ethayarajh, How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics...

work page 2019
[79]

Kovaleva, A

O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, Re- vealing the dark secrets of BERT, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 4365–4374

work page 2019
[80]

Rogers, O

A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: What we know about how BERT works, in: Proceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics, 2020, pp. 1–17

work page 2020

Showing first 80 references.

[1] [1]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018)

work page 2018

[3] [3]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning research 21 (140) (2020) 1– 67

work page 2020

[4] [4]

M. A. Mersha, J. Kalita, et al., Semantic-driven topic modeling using transformer-based embeddings and clus- tering algorithms, Procedia Computer Science 244 (2024) 121–132

work page 2024

[5] [5]

Khapre, M

S. Khapre, M. A. Mersha, H. Shakil, J. Baruah, J. Kalita, Toxicity in online platforms and ai systems: A survey of needs, challenges, mitigations, and future directions, Ex- pert Systems with Applications (2025) 129832

work page 2025

[6] [6]

A. L. Tonja, M. Mersha, A. Kalita, O. Kolesnikova, J. Kalita, First attempt at building parallel corpora for ma- chine translation of northeast india’s very low-resource languages, in: Proceedings of the 20th International Con- ference on Natural Language Processing (ICON), 2023, pp. 534–539

work page 2023

[7] [7]

Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

work page 2017

[8] [8]

Rogers, O

A. Rogers, O. Kovaleva, A. Rumshisky, A primer in bertology: What we know about how bert works, Trans- actions of the association for computational linguistics 8 (2020) 842–866

work page 2020

[9] [9]

M. A. Mersha, J. Kalita, Semantic-driven topic model- ing for analyzing creativity in virtual brainstorming, arXiv preprint arXiv:2509.16835 (2025)

work page arXiv 2025

[10] [10]

S. Liu, F. Le, S. Chakraborty, T. Abdelzaher, On exploring attention-based explanation for transformer models in text classification, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021, pp. 1193–1203

work page 2021

[11] [11]

C. Yeh, Y . Chen, A. Wu, C. Chen, F. Viégas, M. Watten- berg, Attentionviz: A global view of transformer atten- tion, IEEE Transactions on Visualization and Computer Graphics (2023)

work page 2023

[12] [12]

S. Jain, B. C. Wallace, Attention is not explanation, arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902

[13] [13]

Is Attention Interpretable?

S. Serrano, N. A. Smith, Is attention interpretable?, arXiv preprint arXiv:1906.03731 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[14] [14]

Quantifying attention flow in transformers

S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

work page arXiv 2005

[15] [15]

A. K. AlShami, R. Rabinowitz, K. Lam, Y . Shleibik, M. Mersha, T. Boult, J. Kalita, Smart-vision: survey of modern action recognition techniques in vision, Multime- dia tools and applications 84 (27) (2025) 32705–32776

work page 2025

[16] [16]

Sundararajan, A

M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: ICML, 2017

work page 2017

[17] [17]

Kapishnikov, S

A. Kapishnikov, S. Venugopalan, B. Avci, B. Wedin, M. Terry, T. Bolukbasi, Guided integrated gradients: An adaptive path method for removing noise, in: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2021, pp. 5050–5058

work page 2021

[18] [18]

Explaining Recurrent Neural Network Predictions in Sentiment Analysis

L. Arras, G. Montavon, K.-R. Müller, W. Samek, Ex- plaining recurrent neural network predictions in sentiment analysis, arXiv preprint arXiv:1706.07206 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

T. Lan, J. Xu, X. He, J.-N. Hwang, L. Li, Atten- tion consistency for llms explanation, arXiv preprint arXiv:2509.17178 (2025)

work page arXiv 2025

[20] [20]

M. A. Mersha, G. Y . Bade, J. Kalita, O. Kolesnikova, A. Gelbukh, et al., Ethio-fake: Cutting-edge approaches to combat fake news in under-resourced languages using ex- plainable ai, Procedia Computer Science 244 (2024) 133– 142

work page 2024

[21] [21]

M. A. Mersha, M. G. Yigezu, A. L. Tonja, H. Shakil, S. Iskandar, O. Kolesnikova, J. Kalita, Explainable ai: Xai-guided context-aware data augmentation, Expert Sys- tems with Applications (2025) 128364

work page 2025

[22] [22]

M. A. Mersha, M. G. Yigezu, H. Shakil, A. K. AlShami, S. Byun, J. Kalita, A unified framework with novel met- rics for evaluating the effectiveness of xai techniques in llms, arXiv preprint arXiv:2503.05050 (2025)

work page arXiv 2025

[23] [23]

Mersha, M

M. Mersha, M. Bitewa, T. Abay, J. Kalita, Explainability in neural networks for natural language processing tasks, arXiv preprint arXiv:2412.18036 (2024)

work page arXiv 2024

[24] [24]

A Unified Approach to Interpreting Model Predictions

S. Lundberg, A unified approach to interpreting model predictions, arXiv preprint arXiv:1705.07874 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

why should i trust you?

M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any classifier, in: Pro- ceedings of the 22nd ACM SIGKDD international confer- ence on knowledge discovery and data mining, 2016, pp. 1135–1144

work page 2016

[26] [26]

Kamen, M

D. Kamen, M. A. Mersha, J. Kalita, Introducing semantic feature dependencies in nlp xai systems with suplime, in: Recent Advances in Natural Language Processing, 2025, p. 47

work page 2025

[27] [27]

Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol

M. Zeiler, Visualizing and understanding convolutional networks, in: European conference on computer vi- sion/arXiv, V ol. 1311, 2014. 16

work page 2014

[28] [28]

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929

work page 2016

[29] [29]

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations for non- linear classifier decisions by layer-wise relevance propa- gation, PloS one 10 (7) (2015) e0130140

work page 2015

[30] [30]

B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al., Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors (tcav), in: International conference on machine learning, PMLR, 2018, pp. 2668–2677

work page 2018

[31] [31]

D. Shi, R. Jin, T. Shen, W. Dong, X. Wu, D. Xiong, Ircan: Mitigating knowledge conflicts in llm generation via identifying and reweighting context-aware neurons, Advances in Neural Information Processing Systems 37 (2024) 4997–5024

work page 2024

[32] [32]

J. D. Janizek, P. Sturmfels, S.-I. Lee, Explaining explana- tions: Axiomatic feature interactions for deep networks, Journal of Machine Learning Research 22 (104) (2021) 1–54

work page 2021

[33] [33]

Shrikumar, P

A. Shrikumar, P. Greenside, A. Kundaje, Learning impor- tant features through propagating activation differences, in: International conference on machine learning, PMlR, 2017, pp. 3145–3153

work page 2017

[34] [34]

Srinivas, F

S. Srinivas, F. Fleuret, Full-gradient representation for neural network visualization, Advances in neural informa- tion processing systems 32 (2019)

work page 2019

[35] [35]

H. Zhu, F. Wei, B. Qin, T. Liu, Hierarchical attention flow for multiple-choice reading comprehension, in: Proceed- ings of the AAAI Conference on Artificial Intelligence, V ol. 32, 2018

work page 2018

[36] [36]

A Multiscale Visualization of Attention in the Transformer Model

J. Vig, A multiscale visualization of attention in the trans- former model, arXiv preprint arXiv:1906.05714 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[37] [37]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, Interna- tional journal of computer vision 128 (2020) 336–359

work page 2020

[38] [38]

Chefer, S

H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: CVPR, 2021

work page 2021

[39] [39]

Qiang, D

Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, Attcat: Explaining transformers via attentive class activation to- kens, Advances in neural information processing systems 35 (2022) 5052–5064

work page 2022

[40] [40]

T. Yuan, X. Li, H. Xiong, H. Cao, D. Dou, Explaining information flow inside vision transformers using markov chain, in: eXplainable AI approaches for debugging and diagnosis., 2021

work page 2021

[41] [41]

Achtibat, S

R. Achtibat, S. M. V . Hatefi, M. Dreyer, A. Jain, T. Wie- gand, S. Lapuschkin, W. Samek, Attnlrp: attention-aware layer-wise relevance propagation for transformers, arXiv preprint arXiv:2402.05602 (2024)

work page arXiv 2024

[42] [42]

Mersha, K

M. Mersha, K. Lam, J. Wood, A. AlShami, J. Kalita, Ex- plainable artificial intelligence: A survey of needs, tech- niques, applications, and future direction, Neurocomput- ing (2024) 128111

work page 2024

[43] [43]

Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

M. Fantozzi, et al., Explainability in deep learning: Chal- lenges for transformers, Frontiers in Artificial Intelligence (2024)

work page 2024

[44] [44]

Z. Chen, Y . Xie, Y . Wu, Y . Lin, S. Tomiya, J. Lin, An interpretable and transferrable vision transformer model for rapid materials spectra classification, Digital Discov- ery 3 (2) (2024) 369–380

work page 2024

[45] [45]

SmoothGrad: removing noise by adding noise

D. Smilkov, et al., Smoothgrad: removing noise by adding noise, arXiv preprint arXiv:1706.03825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

S. Jain, et al., Inseq: A toolkit for sequence-level interpretability of nlp models,https://github.com/ penwang/inseq(2023)

work page 2023

[47] [47]

Ferrando, G

J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-Jussà, A primer on the inner workings of transformer-based lan- guage models, arXiv preprint arXiv:2405.00208 (2024)

work page arXiv 2024

[48] [48]

Azarkhalili, M

B. Azarkhalili, M. W. Libbrecht, Generalized attention flow: Feature attribution for transformer models via max- imum flow, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 19954–19974

work page 2025

[49] [49]

S. Han, J. Lee, S. Lee, Contrast-cat: Contrasting acti- vations for enhanced interpretability in transformer-based text classifiers, arXiv preprint arXiv:2507.21186 (2025)

work page arXiv 2025

[50] [50]

Wiegreffe, Y

S. Wiegreffe, Y . Pinter, Attention is not not explanation, arXiv preprint arXiv:1908.04626 (2019)

work page arXiv 1908

[51] [51]

A. Ali, A. Kumar, Xai methods for transformers via con- servative propagation, in: ICLR, 2022

work page 2022

[52] [52]

E. M. Hou, G. D. Castanon, Decoding layer saliency in language transformers, in: International Conference on Machine Learning, PMLR, 2023, pp. 13285–13308

work page 2023

[53] [53]

A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, C. Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technolo- gies, 2011, pp. 142–150

work page 2011

[54] [54]

A. A. Ayele, S. M. Yimam, T. D. Belay, T. Asfaw, C. Bie- mann, Exploring amharic hate speech data collection and classification approaches, in: Proceedings of the 14th in- ternational conference on recent advances in natural lan- guage processing, 2023, pp. 49–59. 17

work page 2023

[55] [55]

Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp

K. Lang, Newsweeder: Learning to filter netnews, in: Ma- chine learning proceedings 1995, Elsevier, 1995, pp. 331– 339

work page 1995

[56] [56]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)

work page 2009

[57] [57]

Elson, J

J. Elson, J. R. Douceur, J. Howell, J. Saul, Asirra: a captcha that exploits interest-aligned manual image cat- egorization., CCS 7 (366-374) (2007) 15

work page 2007

[58] [58]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre- training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, vol- ume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[59] [59]

Unsupervised Cross-lingual Representation Learning at Scale

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911

[60] [60]

B. F. Dossou, A. L. Tonja, O. Yousuf, S. Osei, A. Op- pong, I. Shode, O. O. Awoyomi, C. Emezue, Afrolm: A self-active learning-based multilingual pretrained lan- guage model for 23 african languages, in: Proceedings of The Third Workshop on Simple and Efficient Natural Lan- guage Processing (SustaiNLP), 2022, pp. 52–64

work page 2022

[61] [61]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009

work page 2022

[62] [62]

Hollenstein, L

N. Hollenstein, L. Beinborn, Relative importance in sen- tence processing, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021, pp. 141–150

work page 2021

[63] [63]

E. Sood, S. Tannert, D. Frassinelli, A. Bulling, N. T. Vu, Interpreting attention models with human visual at- tention in machine reading comprehension, arXiv preprint arXiv:2010.06396 (2020)

work page arXiv 2010

[64] [64]

DeYoung, S

J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, B. C. Wallace, Eraser: A benchmark to evaluate rationalized nlp models, arXiv preprint arXiv:1911.03429 (2019)

work page arXiv 1911

[65] [65]

M. A. Mersha, M. G. Yigezu, J. Kalita, Evaluating the ef- fectiveness of xai techniques for encoder-based language models, Knowledge-Based Systems 310 (2025) 113042

work page 2025

[66] [66]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Pro- ceedings of the IEEE international conference on com- puter vision, 2017, pp. 618–626

work page 2017

[67] [67]

annotator ratio- nales

O. Zaidan, J. Eisner, C. Piatko, Using “annotator ratio- nales” to improve machine learning for text categoriza- tion, in: Human language technologies 2007: The confer- ence of the North American chapter of the association for computational linguistics; proceedings of the main con- ference, 2007, pp. 260–267

work page 2007

[68] [68]

Tenney, D

I. Tenney, D. Das, E. Pavlick, BERT rediscovers the clas- sical NLP pipeline, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, Association for Computational Linguistics, 2019, pp. 4593–4601

work page 2019

[69] [69]

Hewitt, C

J. Hewitt, C. D. Manning, A structural probe for find- ing syntax in word representations, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Pa- pers), Association for Computational Linguistics, 2019, pp. 4129–4138

work page 2019

[70] [70]

Y . Goldberg, Assessing BERT’s syntactic abilities, in: Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, Association for Com- putational Linguistics, 2019, pp. 3623–3632

work page 2019

[71] [71]

Aoyama, N

T. Aoyama, N. Schneider, Probe-less probing of BERT’s layer-wise linguistic knowledge with masked word pre- diction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Student Research Workshop, Associa- tion for Computational Linguistics, 2022, pp. 195–201

work page 2022

[72] [72]

J. Ferrando, Measuring the mixing of contextual informa- tion in the transformer, in: Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Pro- cessing, Association for Computational Linguistics, 2022

work page 2022

[73] [73]

N. F. Liu, M. Gardner, Y . Belinkov, M. Peters, N. A. Smith, Linguistic knowledge and transferability of con- textual representations, in: Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 1 (Long and Short Papers), Association for Computational Lingu...

work page 2019

[74] [74]

Clark, U

K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does BERT look at? an analysis of BERT’s attention, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, As- sociation for Computational Linguistics, 2019, pp. 276– 286

work page 2019

[75] [75]

Nauta, J

M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y . Schmitt, J. Schlötterer, M. van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Surveys 55 (13s) (2023) 1–42. 18

work page 2023

[76] [76]

Liu, Cunliang kong, ying liu, and maosong sun

Z. Liu, Cunliang kong, ying liu, and maosong sun. 2024. fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics, Findings of the Association for Computational Linguis- tics: ACL (2024) 14551–14558

work page 2024

[77] [77]

C. Sun, X. Qiu, Y . Xu, X. Huang, Fine-tune BERT for extractive summarization, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Asso- ciation for Computational Linguistics, 2019, pp. 3289– 3299

work page 2019

[78] [78]

K. Ethayarajh, How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics...

work page 2019

[79] [79]

Kovaleva, A

O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, Re- vealing the dark secrets of BERT, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 4365–4374

work page 2019

[80] [80]

Rogers, O

A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: What we know about how BERT works, in: Proceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics, 2020, pp. 1–17

work page 2020