Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Janan Arslan; Md Raisul Kibria; S\'ebastien Lafond

arxiv: 2508.04427 · v1 · submitted 2025-08-06 · 💻 cs.LG · cs.AI

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Md Raisul Kibria , S\'ebastien Lafond , Janan Arslan This is my paper

Pith reviewed 2026-05-19 00:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords explainable AImultimodal learningattention mechanismssystematic reviewevaluation methodologiesvision-language modelsXAI

0 comments

The pith

Evaluation methods for XAI in multimodal attention models are inconsistent and overlook modality-specific factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This systematic review examines research published from January 2020 to early 2024 on explainability techniques for multimodal attention-based models. It organizes the literature by model architecture, input modalities, explanation algorithms, and evaluation approaches. The analysis finds heavy concentration on vision-language and language-only models that rely on attention mechanisms, yet these techniques rarely capture complete cross-modal interactions. The central observation is that current evaluation methods remain non-systematic, lack consistent metrics, and ignore cognitive or contextual differences tied to each modality. Readers would care because weak evaluation practices make it difficult to verify whether explanations actually support trustworthy decisions in applied multimodal systems.

Core claim

The paper establishes that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Studies predominantly address vision-language models and employ attention-based explanation algorithms, but these approaches fall short of capturing the full range of modality interactions because of architectural heterogeneity across domains. The authors respond with recommendations for rigorous, transparent, and standardized evaluation and reporting practices to advance more interpretable and accountable multimodal AI.

What carries the argument

Multi-dimensional analysis of the literature across model architecture, modalities, explanation algorithms, and evaluation methodologies.

If this is right

Standardized evaluation protocols would increase consistency when comparing XAI techniques across multimodal tasks.
Explanation algorithms would need to be redesigned to capture interactions between modalities more completely.
Evaluations would incorporate modality-specific cognitive and contextual factors as a required component.
Transparent reporting standards would improve accountability in multimodal AI development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inconsistent evaluations could delay safe adoption of multimodal models in domains that require high trust, such as medical imaging or autonomous navigation.
The observed emphasis on vision-language pairs leaves explainability in other combinations, like audio-text or sensor fusion, relatively unexamined.
Following the recommendations could enable direct benchmarking of different XAI approaches and speed cumulative progress in the area.

Load-bearing premise

The literature search from January 2020 to early 2024 and the chosen analysis dimensions capture a representative and unbiased sample of research in the field.

What would settle it

A new survey that identifies multiple studies sharing the same consistent, robust evaluation protocol that explicitly incorporates modality-specific cognitive and contextual factors would challenge the claim.

Figures

Figures reproduced from arXiv: 2508.04427 by Janan Arslan, Md Raisul Kibria, S\'ebastien Lafond.

**Figure 2.** Figure 2: Publication per year were grouped into themes defined by application domains and tasks. Although each study was assigned to a single theme to streamline the analysis, it is important to note that both the themes and assigned studies are not strictly mutually exclusive and may span multiple disciplines (e.g., Natural Language Processing (NLP) and Translation vs. Question Answering and Summarization). The th… view at source ↗

**Figure 3.** Figure 3: Key Bibliometric Analytics of the Publications [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Representation of modalities The least amount of work is targeted towards code modeling. These findings are aligned with other surveys on XAI (e.g., [40]). An important decision regarding the eligibility criteria in our study is the inclusion of “multichannel" modeling approaches. These criteria cover models that make decisions based on multiple inputs generated by processing the same source input. For ins… view at source ↗

**Figure 5.** Figure 5: Distribution of multimodal/generative and multichannel modeling approaches [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Primary training task objectives 12 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Block diagram illustrating various fusion architecture types: Early fusion (a, b); Hierarchical [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Classification and distribution of explanation algorithms [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Decoder Layer Contribution Matrix in ALTI+ Method [87] [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Initially, the unimodal relevance scores are initialized as identity matrices, while bimodal (cross-modal) relevance maps are initialized with zeros. The attention map update 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 10.** Figure 10: Visual explanation using the multimodal attention-composite method by Chefer et al. for ob [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: SHAP values to explain pathology images and corresponding questions in VQA tasks [74]. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Categories of Explanation Evaluation metrics and distribution [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: Flow-graph for layer 10 attention from GPT-2 small in VISIT for an IOI task [55]. [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

read the original abstract

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This review flags inconsistent evaluations in multimodal XAI but the strength of that claim rests on whether their 2020-early 2024 attention-focused sample captures the broader literature.

read the letter

The main point is that this systematic review finds evaluation practices for explainability in attention-based multimodal models to be mostly non-systematic, inconsistent, and weak on modality-specific or cognitive factors. They back this with a breakdown of papers by architecture, modalities, explanation methods, and evaluation approaches, plus a list of recommendations for better standards going forward. That synthesis is the useful part here. It lines up with the heavy tilt toward vision-language work and attention techniques that shows up in the abstract, and the recommendations read as concrete enough to be actionable for someone planning a new study. They give credit to the performance gains from these models while pointing out where interpretability lags. The soft spot is the search and selection process. Limiting the window to January 2020 through early 2024 and focusing on attention-based models risks under-sampling audio, sensor, or non-attention multimodal papers whose evaluation sections might not have been obvious from titles or abstracts. If the Boolean strings or databases missed relevant work, the pattern of weak evaluations could partly be an artifact of what got included rather than a field-wide fact. The abstract states clear findings but does not spell out the exact inclusion criteria or quantitative synthesis steps, so a referee would need to see those details to assess how representative the sample really is. This is for researchers already working in multimodal XAI who want a quick map of current gaps and some ideas for tightening evaluation. Someone outside the subfield or looking for new methods would get less from it. It deserves peer review because the topic is timely and the critique of evaluation practices could steer future work, even if the methods section needs expansion to address selection concerns.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a systematic literature review of explainability research on multimodal attention-based models, covering publications from January 2020 to early 2024. It analyzes the selected works along the dimensions of model architecture, involved modalities, explanation algorithms, and evaluation methodologies. The central finding is that evaluation methods in this area are largely non-systematic, lacking consistency, robustness, and attention to modality-specific cognitive or contextual factors; the paper concludes with a set of recommendations for more rigorous and standardized practices.

Significance. If the sampled literature accurately reflects the field, the review usefully documents gaps in evaluation rigor for multimodal XAI and supplies concrete recommendations that could help standardize future work. The multi-dimensional framing (architecture, modalities, algorithms, evaluation) is a constructive organizing device for the synthesis.

major comments (1)

[Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.

minor comments (2)

[Abstract] Abstract: 'mulitmodal' is a typographical error and should read 'multimodal'.
Throughout the results sections, some tables summarizing evaluation metrics would benefit from an additional column or footnote clarifying whether the reported metrics are quantitative, qualitative, or human-subject based, to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our systematic literature review. We address the major comment point by point below, providing the strongest honest defense of the manuscript while agreeing where revisions are warranted to improve transparency and reproducibility.

read point-by-point responses

Referee: [Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.

Authors: We thank the referee for this observation. We agree that greater specificity in Section 2 is needed to support independent verification of our search coverage and to strengthen the foundation for our central claim. In the revised manuscript we will expand this section to include: the complete list of databases (IEEE Xplore, ACM Digital Library, Scopus, Web of Science, arXiv, and Google Scholar), the exact Boolean search strings employed (combinations of terms such as 'multimodal attention' AND ('explainability' OR 'XAI' OR 'interpretability') AND modality-specific keywords), the full inclusion/exclusion criteria with justifications, the number of records at each screening stage, and a PRISMA flow diagram. Regarding potential under-sampling, our protocol was deliberately scoped to attention-based multimodal models with explainability components; while vision-language tasks dominate the retrieved literature, we did include qualifying studies involving audio, sensor, and time-series modalities. We will add an explicit limitations subsection discussing the observed distribution across modalities and its implications for generalizing the evaluation-method findings. These changes will make the sampling transparent and allow readers to assess whether the reported patterns in evaluation rigor are representative of the sampled corpus. revision: yes

Circularity Check

0 steps flagged

No circularity: systematic literature review without derivations or self-referential predictions

full rationale

This paper is a systematic literature review that synthesizes published work on explainability in multimodal attention-based models from January 2020 to early 2024. It examines dimensions such as architecture, modalities, explanation algorithms, and evaluation methodologies but contains no equations, fitted parameters, predictions, or derivation chains. The central claim about non-systematic evaluation methods follows directly from the authors' analysis of sampled papers rather than reducing to any self-definition, fitted input, or self-citation load-bearing step. The work is self-contained as an external synthesis with no internal circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The review rests on standard assumptions about literature coverage and categorization without introducing free parameters or new entities.

axioms (1)

domain assumption The chosen time period and focus on attention-based multimodal models capture the relevant body of explainability research.
This premise defines the scope of the systematic review as described in the abstract.

pith-pipeline@v0.9.0 · 5797 in / 1112 out tokens · 57366 ms · 2026-05-19T00:30:03.785396+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities... evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 6: Architecture Variants of Multimodal Attention-based Models... Early Summation, Early Concatenation, Hierarchical Multi-to-One, Single Cross-Attention Branch, Multi-Cross Attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 8 internal anchors

[1]

Guidotti, A

R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A sur- vey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 1–42

work page 2018
[2]

A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward re- sponsible ai, Information fusion 58 (2020) 82–115

work page 2020
[3]

Burkart, M

N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–317

work page 2021
[4]

G.Yang, Q.Ye, J.Xia, Unboxtheblack-boxforthemedicalexplainableaiviamulti- modal and multi-centre data fusion: A mini-review, two showcases and beyond, Information Fusion 77 (2022) 29–52

work page 2022
[5]

Nannini, A

L. Nannini, A. Balayn, A. L. Smith, Explainability in ai policies: A critical review of communications, reports, regulations, and standards in the eu, us, and uk, in: Pro- ceedings of the 2023 ACM conference on fairness, accountability, and transparency, 2023, pp. 1198–1212

work page 2023
[6]

J. Chun, C. S. de Witt, K. Elkins, Comparative global ai regulation: Policy per- spectives from the eu, china, and the us, arXiv preprint arXiv:2410.21279 (2024)

work page arXiv 2024
[7]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13, Springer, 2014, pp. 740–755. 46

work page 2014
[8]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi- task benchmark and analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belg...

work page doi:10.18653/v1/w18-5446 2018
[9]

Pelka, S

O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (roco): a multimodal image dataset, in: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, Springer, 2018, pp. 180–189

work page 2018
[10]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

work page 2015
[11]

Krishna, Y

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan- tidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International journal of com- puter vision 123 (2017) 32–73

work page 2017
[12]

A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientific data 3 (1) (2016) 1–9

work page 2016
[13]

Saxena, J

D. Saxena, J. Cao, Generative adversarial networks (gans) challenges, solutions, and future directions, ACM Computing Surveys (CSUR) 54 (3) (2021) 1–42

work page 2021
[14]

Alzubaidi, J

L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning: concepts, cnn architectures, challenges, applications, future directions, Journal of big Data 8 (2021) 1–74

work page 2021
[15]

Y. Liu, P. Li, X. Hu, Combining context-relevant features with multi-stage atten- tion network for short text classification, Computer Speech & Language 71 (2022) 101268

work page 2022
[16]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

work page 2017
[18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2022, pp. 3202–3211. 47

work page 2022
[20]

P. Xu, X. Zhu, D. A. Clifton, Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 12113–12132

work page 2023
[21]

S. S. Sengar, A. B. Hasan, S. Kumar, F. Carroll, Generative artificial intelligence: A systematic review and applications, arXiv preprint arXiv:2405.11029 (2024)

work page arXiv 2024
[22]

ACL, 2020.ht tps://arxiv.org/abs/2005.00928

S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

work page arXiv 2005
[23]

Qiang, D

Y. Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, AttCAT: Explaining Transformers via Attentive Class Activation Tokens

work page
[24]

Parcalabescu, A

L. Parcalabescu, A. Frank, Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks, arXiv preprint arXiv:2212.08158 (2022)

work page arXiv 2022
[25]

Rodis, C

N. Rodis, C. Sardianos, P. Radoglou-Grammatikis, P. Sarigiannidis, I. Varlamis, G. T. Papadopoulos, Multimodal explainable artificial intelligence: A comprehen- sive review of methodological advances and future research directions, IEEE Access (2024)

work page 2024
[26]

Towards A Rigorous Science of Interpretable Machine Learning

F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learn- ing, arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining ex- planations: An overview of interpretability of machine learning, in: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), IEEE, 2018, pp. 80–89

work page 2018
[28]

Fantozzi, M

P. Fantozzi, M. Naldi, The explainability of transformers: Current status and di- rections, Computers 13 (4) (2024) 92

work page 2024
[29]

Mohseni, N

S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (3-4) (2021) 1–45

work page 2021
[30]

Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

B. Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

work page 2004
[31]

Moher, A

D. Moher, A. Liberati, J. Tetzlaff, D. G. Altman, Preferred reporting items for systematic reviews and meta-analyses: the prisma statement, Bmj 339 (2009)

work page 2009
[32]

Altmäe, A

S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in scientific writing: a friend or a foe?, Reproductive BioMedicine Online 47 (1) (2023) 3–9

work page 2023
[33]

Dagdelen, A

J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, A. Jain, Structured information extraction from scientific text with large language models, Nature Communications 15 (1) (2024) 1418. 48

work page 2024
[34]

K. R. Felizardo, M. S. Lima, A. Deizepe, T. U. Conte, I. Steinmacher, Chatgpt application in systematic literature reviews in software engineering: an evalua- tion of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement, 2024, pp. 25–36

work page 2024
[35]

Huotala, M

A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The promise and challenges of using llms to accelerate the screening process of systematic reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 262–271

work page 2024
[36]

The Llama 3 Herd of Models

A.Grattafiori, A.Dubey, A.Jauhri, A.Pandey, A.Kadian, A.Al-Dahle, A.Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Ad- vances in neural information processing systems 35 (2022) 24824–24837

work page 2022
[38]

C. Wohlin, Guidelines for snowballing in systematic literature studies and a repli- cation in software engineering, in: Proceedings of the 18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10

work page 2014
[39]

Emerging properties in self-supervised vision transformers

H. Chefer, S. Gur, L. Wolf, Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 2021, pp. 387–396.doi:10.1109/ICCV48922.2021.00045. URLhttps://ieeexplore.ieee.org/document/9710570/

work page doi:10.1109/iccv48922.2021.00045 2021
[40]

Nauta, J

M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y. Schmitt, J. Schlötterer, M. Van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Sur- veys 55 (13s) (2023) 1–42

work page 2023
[41]

Y. Yang, L. Jiao, F. Liu, X. Liu, L. Li, P. Chen, S. Yang, An Explainable Spatial–Frequency Multiscale Transformer for Remote Sensing Scene Classifica- tion, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–15. doi:10.1109/TGRS.2023.3265361. URLhttps://ieeexplore.ieee.org/document/10097579/

work page doi:10.1109/tgrs.2023.3265361 2023
[42]

Huang, A

Y. Huang, A. Jia, X. Zhang, J. Zhang, Generic Attention-model Explainability by Weighted Relevance Accumulation, in: ACM Multimedia Asia 2023, ACM, Tainan Taiwan, 2023, pp. 1–7.doi:10.1145/3595916.3626437. URLhttps://dl.acm.org/doi/10.1145/3595916.3626437

work page doi:10.1145/3595916.3626437 2023
[43]

Y. Guo, F. Cai, H. Chen, C. Chen, X. Zhang, M. Zhang, An Explainable Recom- mendation Method based on Diffusion Model, in: 2023 9th International Conference on Big Data and Information Analytics (BigDIA), IEEE, Haikou, China, 2023, pp. 802–806.doi:10.1109/BigDIA60676.2023.10429319. URLhttps://ieeexplore.ieee.org/document/10429319/ 49

work page doi:10.1109/bigdia60676.2023.10429319 2023
[44]

Liang, Y

Z. Liang, Y. Zhao, M. Surdeanu, Using the Hammer only on Nails: A Hybrid Method for Representation-Based Evidence Retrieval for Question Answering, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Vol. 12656, Springer International Publishing, Cham, 2021, pp. 327–341, series Title: Le...

work page doi:10.1007/978-3-030-72113-8_22 2021
[45]

H. Wang, Y. Gao, Y. Bai, M. Lapata, H. Huang, Exploring Explainable Selection to Control Abstractive Summarization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (15) (2021) 13933–13941.doi:10.1609/aaai.v35i15. 17641. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/17641

work page doi:10.1609/aaai.v35i15 2021
[46]

Malkiel, D

I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, J. Weill, N. Koenigstein, Interpreting BERT-based Text Similarity via Activation and Saliency Maps, arXiv:2208.06612 [cs] (Aug. 2022).doi:10.48550/arXiv.2208.06612. URLhttp://arxiv.org/abs/2208.06612

work page doi:10.48550/arxiv.2208.06612 2022
[47]

Ferrando, M

J. Ferrando, M. R. Costa-jussà, Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions, arXiv:2109.05853 [cs] (Sep. 2021).doi:10.48550/arXiv.2109.05853. URLhttp://arxiv.org/abs/2109.05853

work page doi:10.48550/arxiv.2109.05853 2021
[48]

Zhang, L

K. Zhang, L. Li, Explainable multimodal trajectory prediction using attention mod- els, Transportation Research Part C: Emerging Technologies 143 (2022) 103829. doi:10.1016/j.trc.2022.103829. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X22002509

work page doi:10.1016/j.trc.2022.103829 2022
[49]

Treviso, N

M. Treviso, N. M. Guerreiro, R. Rei, A. F. T. Martins, IST-Unbabel 2021 Sub- mission for the Explainable Quality Estimation Shared Task, in: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 133–145. doi:10.18653/v1/2021.eval4nlp-1.14. URLhttps:...

work page doi:10.18653/v1/2021.eval4nlp-1.14 2021
[50]

S. Wang, Q. Zeng, W. Ni, C. Cheng, Y. Wang, ODP-Transformer: Interpretation of pest classification results using image caption generation techniques, Computers and Electronics in Agriculture 209 (2023) 107863.doi:10.1016/j.compag.2023. 107863. URLhttps://linkinghub.elsevier.com/retrieve/pii/S016816992300251X

work page doi:10.1016/j.compag.2023 2023
[51]

J. Sun, S. Wang, J. Zhang, C. Zong, Neural Encoding and Decoding With Dis- tributed Sentence Representations, IEEE Transactions on Neural Networks and Learning Systems 32 (2) (2021) 589–603.doi:10.1109/TNNLS.2020.3027595. URLhttps://ieeexplore.ieee.org/document/9223750/

work page doi:10.1109/tnnls.2020.3027595 2021
[52]

URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

L.Yu, W.Xiang, J.Fang, Y.-P.P.Chen, L.Chi, eX-ViT:ANovelexplainablevision transformer for weakly supervised semantic segmentation, Pattern Recognition 142 (2023) 109666.doi:10.1016/j.patcog.2023.109666. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

work page doi:10.1016/j.patcog.2023.109666 2023
[53]

Parelli, D

M. Parelli, D. Mallis, M. Diomataris, V. Pitsikalis, Interpretable Visual Question Answering via Reasoning Supervision, arXiv:2309.03726 [cs] (Sep. 2023).doi: 10.48550/arXiv.2309.03726. URLhttp://arxiv.org/abs/2309.03726

work page doi:10.48550/arxiv.2309.03726 2023
[54]

Aflalo, M

E. Aflalo, M. Du, S.-Y. Tseng, Y. Liu, C. Wu, N. Duan, V. Lal, VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, arXiv:2203.17247 [cs] (Aug. 2022).doi:10.48550/arXiv.2203.17247. URLhttp://arxiv.org/abs/2203.17247

work page doi:10.48550/arxiv.2203.17247 2022
[55]

S. Katz, Y. Belinkov, VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers, arXiv:2305.13417 [cs] (Nov. 2023).doi:10.48550/arXiv. 2305.13417. URLhttp://arxiv.org/abs/2305.13417

work page internal anchor Pith review doi:10.48550/arxiv 2023
[56]

X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y.Choi, J.Gao, Oscar: Object-SemanticsAlignedPre-trainingforVision-Language Tasks, arXiv:2004.06165 [cs] (Jul. 2020).doi:10.48550/arXiv.2004.06165. URLhttp://arxiv.org/abs/2004.06165

work page doi:10.48550/arxiv.2004.06165 2004
[57]

R. K. Kandukuri, J. Achterhold, M. Moeller, J. Stueckler, Physical Represen- tation Learning and Parameter Identification from Video Using Differentiable Physics, International Journal of Computer Vision 130 (1) (2022) 3–16.doi: 10.1007/s11263-021-01493-5. URLhttps://link.springer.com/10.1007/s11263-021-01493-5

work page doi:10.1007/s11263-021-01493-5 2022
[58]

W. Sun, C. Wang, H. Wu, Y. Miao, H. Zhu, W. Guo, J. Li, DFYOLOv5m- M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques, Computers and Electronics in Agriculture 215 (2023) 108460.doi:10.1016/j.compag.2023.108460. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0168169923008487

work page doi:10.1016/j.compag.2023.108460 2023
[59]

Rigotti, C

M. Rigotti, C. Miksovic, I. Giurgiu, T. Gschwind, P. Scotton, ATTENTION- BASED INTERPRETABILITY WITH CONCEPT TRANSFORMERS (2022)

work page 2022
[60]

Y. Heo, S. Kang, J. Seo, Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System, Sensors 23 (18) (2023) 7875. doi:10.3390/s23187875. URLhttps://www.mdpi.com/1424-8220/23/18/7875

work page doi:10.3390/s23187875 2023
[61]

URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

J.Dong, S.Chen, M.Miralinaghi, T.Chen, P.Li, S.Labi, WhydidtheAImakethat decision? Towards an explainable artificial intelligence (XAI) for autonomous driv- ing systems, Transportation Research Part C: Emerging Technologies 156 (2023) 104358.doi:10.1016/j.trc.2023.104358. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

work page doi:10.1016/j.trc.2023.104358 2023
[62]

A. H. Mohammadkhani, C. Tantithamthavorn, H. Hemmatif, Explaining Transformer-based Code Models: What Do They Learn? When They Do Not Work?, in: 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM), IEEE, Bogotá, Colombia, 2023, pp. 96–106. doi:10.1109/SCAM59687.2023.00020. URLhttps://ieeexplore.ieee.org/document...

work page doi:10.1109/scam59687.2023.00020 2023
[63]

R. Buoy, M. Iwamura, S. Srun, K. Kise, Explainable Connectionist-Temporal- Classification-Based Scene Text Recognition, Journal of Imaging 9 (11) (2023) 248. doi:10.3390/jimaging9110248. URLhttps://www.mdpi.com/2313-433X/9/11/248

work page doi:10.3390/jimaging9110248 2023
[64]

M. Z. Boito, A. Villavicencio, L. Besacier, Investigating alignment interpretability for low-resource NMT, Machine Translation 34 (4) (2020) 305–323.doi:10.1007/ s10590-020-09254-w. URLhttp://link.springer.com/10.1007/s10590-020-09254-w

work page doi:10.1007/s10590-020-09254-w 2020
[65]

T. Chen, S. Liu, Z. Chen, W. Hu, D. Chen, Y. Wang, Q. Lyu, C. X. Le, W. Wang, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, Advances in Artificial Intelligence and Machine Learning 03 (03) (2023) 1369–1388.doi:10.54364/AAIML.2023.1181. URLhttps://www.oajaiml.com/uploads/archivepdf/50081181.pdf

work page doi:10.54364/aaiml.2023.1181 2023
[66]

Jaitly, Q

N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, S. Bengio, An online sequence-to-sequence model using partial conditioning, Advances in neural infor- mation processing systems 29 (2016)

work page 2016
[67]

attention

R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of" attention" in sequence-to-sequence models., in: Interspeech, 2017, pp. 3702–3706

work page 2017
[68]

Y. Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirectional Representation Learn- ing From Transformers Using Multimodal Electronic Health Record Data to Predict Depression, IEEE Journal of Biomedical and Health Informatics 25 (8) (2021) 3121– 3129.doi:10.1109/JBHI.2021.3063721. URLhttps://ieeexplore.ieee.org/document/9369833/

work page doi:10.1109/jbhi.2021.3063721 2021
[69]

D. Lee, C. H. Suh, J. Kim, W. Jung, C. Park, K.-H. Jung, S. T. Kong, W. H. Shim, H. Heo, S. J. Kim, Augmenting Magnetic Resonance Imaging with Tabular Fea- tures for Enhanced and Interpretable Medial Temporal Lobe Atrophy Prediction, in: A. Abdulkadir, D. R. Bathula, N. C. Dvornek, M. Habes, S. M. Kia, V. Ku- mar, T. Wolfers (Eds.), Machine Learning in Cl...

work page doi:10.1007/978-3-031-17899-3_13 2022
[70]

C. C. Ukwuoma, Z. Qin, M. Belal Bin Heyat, F. Akhtar, O. Bamisile, A. Y. Muaad, D. Addo, M. A. Al-antari, A hybrid explainable ensemble transformer encoder for pneumonia identification from chest X-ray images, Journal of Advanced Research 48 (2023) 191–211.doi:10.1016/j.jare.2022.08.021. URLhttps://linkinghub.elsevier.com/retrieve/pii/S2090123222002028

work page doi:10.1016/j.jare.2022.08.021 2023
[71]

Chiewhawan, P

T. Chiewhawan, P. Vateekul, Explainable Deep Learning for Thai Stock Market Prediction Using Textual Representation and Technical Indicators, in: Proceed- ings of the 8th International Conference on Computer and Communications Man- agement, ACM, Singapore Singapore, 2020, pp. 19–23.doi:10.1145/3411174. 3411191. 52

work page doi:10.1145/3411174 2020
[72]

URLhttps://link.springer.com/10.1007/s10489-022-04254-0

B.Wu, L.Wang, Y.-R.Zeng, Interpretabletourismdemandforecastingwithtempo- ral fusion transformers amid COVID-19, Applied Intelligence 53 (11) (2023) 14493– 14514.doi:10.1007/s10489-022-04254-0. URLhttps://link.springer.com/10.1007/s10489-022-04254-0

work page doi:10.1007/s10489-022-04254-0 2023
[73]

D. Wang, W. Li, X. Dong, H. Li, L. Hu, TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion, Journal of Chemical Information and Modeling 63 (3) (2023) 782–793.doi:10.1021/acs. jcim.2c01283. URLhttps://pubs.acs.org/doi/10.1021/acs.jcim.2c01283

work page doi:10.1021/acs 2023
[74]

Naseem, M

U. Naseem, M. Khushi, J. Kim, Vision-Language Transformer for Interpretable Pathology Visual Question Answering, IEEE Journal of Biomedical and Health Informatics 27 (4) (2023) 1681–1690.doi:10.1109/JBHI.2022.3163751. URLhttps://ieeexplore.ieee.org/document/9745795/

work page doi:10.1109/jbhi.2022.3163751 2023
[75]

S. Xu, W. Zhang, F. Zhang, Multi-Granular BERT: An Interpretable Model Ap- plicable to Internet-of-Thing devices, in: 2020 IEEE International Conference on Energy Internet (ICEI), IEEE, Sydney, NSW, Australia, 2020, pp. 134–139. doi:10.1109/ICEI49372.2020.00032. URLhttps://ieeexplore.ieee.org/document/9270262/

work page doi:10.1109/icei49372.2020.00032 2020
[76]

Janssens, L

B. Janssens, L. Schetgen, M. Bogaert, M. Meire, D. Van Den Poel, 360 Degrees rumor detection: When explanations got some explaining to do, European Journal of Operational Research 317 (2) (2024) 366–381.doi:10.1016/j.ejor.2023.06. 024. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0377221723004769

work page doi:10.1016/j.ejor.2023.06 2024
[77]

P. Ding, Y. Wang, X. Zhang, X. Gao, G. Liu, B. Yu, DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks com- bining sequence and shape, Briefings in Bioinformatics 24 (4) (2023) bbad231. doi:10.1093/bib/bbad231. URLhttps://academic.oup.com/bib/article/doi/10.1093/bib/bbad231/ 7199560

work page doi:10.1093/bib/bbad231 2023
[78]

Feucht, Z

M. Feucht, Z. Wu, S. Althammer, V. Tresp, Description-based Label Attention Classifier for Explainable ICD-9 Classification, arXiv:2109.12026 [cs] (Sep. 2021). doi:10.48550/arXiv.2109.12026. URLhttp://arxiv.org/abs/2109.12026

work page doi:10.48550/arxiv.2109.12026 2021
[79]

Kumar, V

P. Kumar, V. Kaushik, B. Raman, Towards the Explainability of Multimodal Speech Emotion Recognition, in: Interspeech 2021, ISCA, 2021, pp. 1748–1752. doi:10.21437/Interspeech.2021-1718. URLhttps://www.isca-archive.org/interspeech_2021/kumar21d_ interspeech.html

work page doi:10.21437/interspeech.2021-1718 2021
[80]

Ullah, A

F. Ullah, A. Alsirhani, M. M. Alshahrani, A. Alomari, H. Naeem, S. A. Shah, Explainable Malware Detection System Using Transformers-Based Transfer Learn- ing and Multi-Model Visual Representation, Sensors 22 (18) (2022) 6766.doi: 10.3390/s22186766. 53

work page doi:10.3390/s22186766 2022

Showing first 80 references.

[1] [1]

Guidotti, A

R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A sur- vey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 1–42

work page 2018

[2] [2]

A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward re- sponsible ai, Information fusion 58 (2020) 82–115

work page 2020

[3] [3]

Burkart, M

N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–317

work page 2021

[4] [4]

G.Yang, Q.Ye, J.Xia, Unboxtheblack-boxforthemedicalexplainableaiviamulti- modal and multi-centre data fusion: A mini-review, two showcases and beyond, Information Fusion 77 (2022) 29–52

work page 2022

[5] [5]

Nannini, A

L. Nannini, A. Balayn, A. L. Smith, Explainability in ai policies: A critical review of communications, reports, regulations, and standards in the eu, us, and uk, in: Pro- ceedings of the 2023 ACM conference on fairness, accountability, and transparency, 2023, pp. 1198–1212

work page 2023

[6] [6]

J. Chun, C. S. de Witt, K. Elkins, Comparative global ai regulation: Policy per- spectives from the eu, china, and the us, arXiv preprint arXiv:2410.21279 (2024)

work page arXiv 2024

[7] [7]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13, Springer, 2014, pp. 740–755. 46

work page 2014

[8] [8]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi- task benchmark and analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belg...

work page doi:10.18653/v1/w18-5446 2018

[9] [9]

Pelka, S

O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (roco): a multimodal image dataset, in: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, Springer, 2018, pp. 180–189

work page 2018

[10] [10]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

work page 2015

[11] [11]

Krishna, Y

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan- tidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International journal of com- puter vision 123 (2017) 32–73

work page 2017

[12] [12]

A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientific data 3 (1) (2016) 1–9

work page 2016

[13] [13]

Saxena, J

D. Saxena, J. Cao, Generative adversarial networks (gans) challenges, solutions, and future directions, ACM Computing Surveys (CSUR) 54 (3) (2021) 1–42

work page 2021

[14] [14]

Alzubaidi, J

L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning: concepts, cnn architectures, challenges, applications, future directions, Journal of big Data 8 (2021) 1–74

work page 2021

[15] [15]

Y. Liu, P. Li, X. Hu, Combining context-relevant features with multi-stage atten- tion network for short text classification, Computer Speech & Language 71 (2022) 101268

work page 2022

[16] [16]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

work page 2017

[18] [18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[19] [19]

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2022, pp. 3202–3211. 47

work page 2022

[20] [20]

P. Xu, X. Zhu, D. A. Clifton, Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 12113–12132

work page 2023

[21] [21]

S. S. Sengar, A. B. Hasan, S. Kumar, F. Carroll, Generative artificial intelligence: A systematic review and applications, arXiv preprint arXiv:2405.11029 (2024)

work page arXiv 2024

[22] [22]

ACL, 2020.ht tps://arxiv.org/abs/2005.00928

S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

work page arXiv 2005

[23] [23]

Qiang, D

Y. Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, AttCAT: Explaining Transformers via Attentive Class Activation Tokens

work page

[24] [24]

Parcalabescu, A

L. Parcalabescu, A. Frank, Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks, arXiv preprint arXiv:2212.08158 (2022)

work page arXiv 2022

[25] [25]

Rodis, C

N. Rodis, C. Sardianos, P. Radoglou-Grammatikis, P. Sarigiannidis, I. Varlamis, G. T. Papadopoulos, Multimodal explainable artificial intelligence: A comprehen- sive review of methodological advances and future research directions, IEEE Access (2024)

work page 2024

[26] [26]

Towards A Rigorous Science of Interpretable Machine Learning

F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learn- ing, arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining ex- planations: An overview of interpretability of machine learning, in: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), IEEE, 2018, pp. 80–89

work page 2018

[28] [28]

Fantozzi, M

P. Fantozzi, M. Naldi, The explainability of transformers: Current status and di- rections, Computers 13 (4) (2024) 92

work page 2024

[29] [29]

Mohseni, N

S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (3-4) (2021) 1–45

work page 2021

[30] [30]

Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

B. Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

work page 2004

[31] [31]

Moher, A

D. Moher, A. Liberati, J. Tetzlaff, D. G. Altman, Preferred reporting items for systematic reviews and meta-analyses: the prisma statement, Bmj 339 (2009)

work page 2009

[32] [32]

Altmäe, A

S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in scientific writing: a friend or a foe?, Reproductive BioMedicine Online 47 (1) (2023) 3–9

work page 2023

[33] [33]

Dagdelen, A

J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, A. Jain, Structured information extraction from scientific text with large language models, Nature Communications 15 (1) (2024) 1418. 48

work page 2024

[34] [34]

K. R. Felizardo, M. S. Lima, A. Deizepe, T. U. Conte, I. Steinmacher, Chatgpt application in systematic literature reviews in software engineering: an evalua- tion of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement, 2024, pp. 25–36

work page 2024

[35] [35]

Huotala, M

A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The promise and challenges of using llms to accelerate the screening process of systematic reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 262–271

work page 2024

[36] [36]

The Llama 3 Herd of Models

A.Grattafiori, A.Dubey, A.Jauhri, A.Pandey, A.Kadian, A.Al-Dahle, A.Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Ad- vances in neural information processing systems 35 (2022) 24824–24837

work page 2022

[38] [38]

C. Wohlin, Guidelines for snowballing in systematic literature studies and a repli- cation in software engineering, in: Proceedings of the 18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10

work page 2014

[39] [39]

Emerging properties in self-supervised vision transformers

H. Chefer, S. Gur, L. Wolf, Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 2021, pp. 387–396.doi:10.1109/ICCV48922.2021.00045. URLhttps://ieeexplore.ieee.org/document/9710570/

work page doi:10.1109/iccv48922.2021.00045 2021

[40] [40]

Nauta, J

M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y. Schmitt, J. Schlötterer, M. Van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Sur- veys 55 (13s) (2023) 1–42

work page 2023

[41] [41]

Y. Yang, L. Jiao, F. Liu, X. Liu, L. Li, P. Chen, S. Yang, An Explainable Spatial–Frequency Multiscale Transformer for Remote Sensing Scene Classifica- tion, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–15. doi:10.1109/TGRS.2023.3265361. URLhttps://ieeexplore.ieee.org/document/10097579/

work page doi:10.1109/tgrs.2023.3265361 2023

[42] [42]

Huang, A

Y. Huang, A. Jia, X. Zhang, J. Zhang, Generic Attention-model Explainability by Weighted Relevance Accumulation, in: ACM Multimedia Asia 2023, ACM, Tainan Taiwan, 2023, pp. 1–7.doi:10.1145/3595916.3626437. URLhttps://dl.acm.org/doi/10.1145/3595916.3626437

work page doi:10.1145/3595916.3626437 2023

[43] [43]

Y. Guo, F. Cai, H. Chen, C. Chen, X. Zhang, M. Zhang, An Explainable Recom- mendation Method based on Diffusion Model, in: 2023 9th International Conference on Big Data and Information Analytics (BigDIA), IEEE, Haikou, China, 2023, pp. 802–806.doi:10.1109/BigDIA60676.2023.10429319. URLhttps://ieeexplore.ieee.org/document/10429319/ 49

work page doi:10.1109/bigdia60676.2023.10429319 2023

[44] [44]

Liang, Y

Z. Liang, Y. Zhao, M. Surdeanu, Using the Hammer only on Nails: A Hybrid Method for Representation-Based Evidence Retrieval for Question Answering, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Vol. 12656, Springer International Publishing, Cham, 2021, pp. 327–341, series Title: Le...

work page doi:10.1007/978-3-030-72113-8_22 2021

[45] [45]

H. Wang, Y. Gao, Y. Bai, M. Lapata, H. Huang, Exploring Explainable Selection to Control Abstractive Summarization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (15) (2021) 13933–13941.doi:10.1609/aaai.v35i15. 17641. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/17641

work page doi:10.1609/aaai.v35i15 2021

[46] [46]

Malkiel, D

I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, J. Weill, N. Koenigstein, Interpreting BERT-based Text Similarity via Activation and Saliency Maps, arXiv:2208.06612 [cs] (Aug. 2022).doi:10.48550/arXiv.2208.06612. URLhttp://arxiv.org/abs/2208.06612

work page doi:10.48550/arxiv.2208.06612 2022

[47] [47]

Ferrando, M

J. Ferrando, M. R. Costa-jussà, Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions, arXiv:2109.05853 [cs] (Sep. 2021).doi:10.48550/arXiv.2109.05853. URLhttp://arxiv.org/abs/2109.05853

work page doi:10.48550/arxiv.2109.05853 2021

[48] [48]

Zhang, L

K. Zhang, L. Li, Explainable multimodal trajectory prediction using attention mod- els, Transportation Research Part C: Emerging Technologies 143 (2022) 103829. doi:10.1016/j.trc.2022.103829. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X22002509

work page doi:10.1016/j.trc.2022.103829 2022

[49] [49]

Treviso, N

M. Treviso, N. M. Guerreiro, R. Rei, A. F. T. Martins, IST-Unbabel 2021 Sub- mission for the Explainable Quality Estimation Shared Task, in: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 133–145. doi:10.18653/v1/2021.eval4nlp-1.14. URLhttps:...

work page doi:10.18653/v1/2021.eval4nlp-1.14 2021

[50] [50]

S. Wang, Q. Zeng, W. Ni, C. Cheng, Y. Wang, ODP-Transformer: Interpretation of pest classification results using image caption generation techniques, Computers and Electronics in Agriculture 209 (2023) 107863.doi:10.1016/j.compag.2023. 107863. URLhttps://linkinghub.elsevier.com/retrieve/pii/S016816992300251X

work page doi:10.1016/j.compag.2023 2023

[51] [51]

J. Sun, S. Wang, J. Zhang, C. Zong, Neural Encoding and Decoding With Dis- tributed Sentence Representations, IEEE Transactions on Neural Networks and Learning Systems 32 (2) (2021) 589–603.doi:10.1109/TNNLS.2020.3027595. URLhttps://ieeexplore.ieee.org/document/9223750/

work page doi:10.1109/tnnls.2020.3027595 2021

[52] [52]

URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

L.Yu, W.Xiang, J.Fang, Y.-P.P.Chen, L.Chi, eX-ViT:ANovelexplainablevision transformer for weakly supervised semantic segmentation, Pattern Recognition 142 (2023) 109666.doi:10.1016/j.patcog.2023.109666. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

work page doi:10.1016/j.patcog.2023.109666 2023

[53] [53]

Parelli, D

M. Parelli, D. Mallis, M. Diomataris, V. Pitsikalis, Interpretable Visual Question Answering via Reasoning Supervision, arXiv:2309.03726 [cs] (Sep. 2023).doi: 10.48550/arXiv.2309.03726. URLhttp://arxiv.org/abs/2309.03726

work page doi:10.48550/arxiv.2309.03726 2023

[54] [54]

Aflalo, M

E. Aflalo, M. Du, S.-Y. Tseng, Y. Liu, C. Wu, N. Duan, V. Lal, VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, arXiv:2203.17247 [cs] (Aug. 2022).doi:10.48550/arXiv.2203.17247. URLhttp://arxiv.org/abs/2203.17247

work page doi:10.48550/arxiv.2203.17247 2022

[55] [55]

S. Katz, Y. Belinkov, VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers, arXiv:2305.13417 [cs] (Nov. 2023).doi:10.48550/arXiv. 2305.13417. URLhttp://arxiv.org/abs/2305.13417

work page internal anchor Pith review doi:10.48550/arxiv 2023

[56] [56]

X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y.Choi, J.Gao, Oscar: Object-SemanticsAlignedPre-trainingforVision-Language Tasks, arXiv:2004.06165 [cs] (Jul. 2020).doi:10.48550/arXiv.2004.06165. URLhttp://arxiv.org/abs/2004.06165

work page doi:10.48550/arxiv.2004.06165 2004

[57] [57]

R. K. Kandukuri, J. Achterhold, M. Moeller, J. Stueckler, Physical Represen- tation Learning and Parameter Identification from Video Using Differentiable Physics, International Journal of Computer Vision 130 (1) (2022) 3–16.doi: 10.1007/s11263-021-01493-5. URLhttps://link.springer.com/10.1007/s11263-021-01493-5

work page doi:10.1007/s11263-021-01493-5 2022

[58] [58]

W. Sun, C. Wang, H. Wu, Y. Miao, H. Zhu, W. Guo, J. Li, DFYOLOv5m- M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques, Computers and Electronics in Agriculture 215 (2023) 108460.doi:10.1016/j.compag.2023.108460. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0168169923008487

work page doi:10.1016/j.compag.2023.108460 2023

[59] [59]

Rigotti, C

M. Rigotti, C. Miksovic, I. Giurgiu, T. Gschwind, P. Scotton, ATTENTION- BASED INTERPRETABILITY WITH CONCEPT TRANSFORMERS (2022)

work page 2022

[60] [60]

Y. Heo, S. Kang, J. Seo, Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System, Sensors 23 (18) (2023) 7875. doi:10.3390/s23187875. URLhttps://www.mdpi.com/1424-8220/23/18/7875

work page doi:10.3390/s23187875 2023

[61] [61]

URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

J.Dong, S.Chen, M.Miralinaghi, T.Chen, P.Li, S.Labi, WhydidtheAImakethat decision? Towards an explainable artificial intelligence (XAI) for autonomous driv- ing systems, Transportation Research Part C: Emerging Technologies 156 (2023) 104358.doi:10.1016/j.trc.2023.104358. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

work page doi:10.1016/j.trc.2023.104358 2023

[62] [62]

A. H. Mohammadkhani, C. Tantithamthavorn, H. Hemmatif, Explaining Transformer-based Code Models: What Do They Learn? When They Do Not Work?, in: 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM), IEEE, Bogotá, Colombia, 2023, pp. 96–106. doi:10.1109/SCAM59687.2023.00020. URLhttps://ieeexplore.ieee.org/document...

work page doi:10.1109/scam59687.2023.00020 2023

[63] [63]

R. Buoy, M. Iwamura, S. Srun, K. Kise, Explainable Connectionist-Temporal- Classification-Based Scene Text Recognition, Journal of Imaging 9 (11) (2023) 248. doi:10.3390/jimaging9110248. URLhttps://www.mdpi.com/2313-433X/9/11/248

work page doi:10.3390/jimaging9110248 2023

[64] [64]

M. Z. Boito, A. Villavicencio, L. Besacier, Investigating alignment interpretability for low-resource NMT, Machine Translation 34 (4) (2020) 305–323.doi:10.1007/ s10590-020-09254-w. URLhttp://link.springer.com/10.1007/s10590-020-09254-w

work page doi:10.1007/s10590-020-09254-w 2020

[65] [65]

T. Chen, S. Liu, Z. Chen, W. Hu, D. Chen, Y. Wang, Q. Lyu, C. X. Le, W. Wang, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, Advances in Artificial Intelligence and Machine Learning 03 (03) (2023) 1369–1388.doi:10.54364/AAIML.2023.1181. URLhttps://www.oajaiml.com/uploads/archivepdf/50081181.pdf

work page doi:10.54364/aaiml.2023.1181 2023

[66] [66]

Jaitly, Q

N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, S. Bengio, An online sequence-to-sequence model using partial conditioning, Advances in neural infor- mation processing systems 29 (2016)

work page 2016

[67] [67]

attention

R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of" attention" in sequence-to-sequence models., in: Interspeech, 2017, pp. 3702–3706

work page 2017

[68] [68]

Y. Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirectional Representation Learn- ing From Transformers Using Multimodal Electronic Health Record Data to Predict Depression, IEEE Journal of Biomedical and Health Informatics 25 (8) (2021) 3121– 3129.doi:10.1109/JBHI.2021.3063721. URLhttps://ieeexplore.ieee.org/document/9369833/

work page doi:10.1109/jbhi.2021.3063721 2021

[69] [69]

D. Lee, C. H. Suh, J. Kim, W. Jung, C. Park, K.-H. Jung, S. T. Kong, W. H. Shim, H. Heo, S. J. Kim, Augmenting Magnetic Resonance Imaging with Tabular Fea- tures for Enhanced and Interpretable Medial Temporal Lobe Atrophy Prediction, in: A. Abdulkadir, D. R. Bathula, N. C. Dvornek, M. Habes, S. M. Kia, V. Ku- mar, T. Wolfers (Eds.), Machine Learning in Cl...

work page doi:10.1007/978-3-031-17899-3_13 2022

[70] [70]

C. C. Ukwuoma, Z. Qin, M. Belal Bin Heyat, F. Akhtar, O. Bamisile, A. Y. Muaad, D. Addo, M. A. Al-antari, A hybrid explainable ensemble transformer encoder for pneumonia identification from chest X-ray images, Journal of Advanced Research 48 (2023) 191–211.doi:10.1016/j.jare.2022.08.021. URLhttps://linkinghub.elsevier.com/retrieve/pii/S2090123222002028

work page doi:10.1016/j.jare.2022.08.021 2023

[71] [71]

Chiewhawan, P

T. Chiewhawan, P. Vateekul, Explainable Deep Learning for Thai Stock Market Prediction Using Textual Representation and Technical Indicators, in: Proceed- ings of the 8th International Conference on Computer and Communications Man- agement, ACM, Singapore Singapore, 2020, pp. 19–23.doi:10.1145/3411174. 3411191. 52

work page doi:10.1145/3411174 2020

[72] [72]

URLhttps://link.springer.com/10.1007/s10489-022-04254-0

B.Wu, L.Wang, Y.-R.Zeng, Interpretabletourismdemandforecastingwithtempo- ral fusion transformers amid COVID-19, Applied Intelligence 53 (11) (2023) 14493– 14514.doi:10.1007/s10489-022-04254-0. URLhttps://link.springer.com/10.1007/s10489-022-04254-0

work page doi:10.1007/s10489-022-04254-0 2023

[73] [73]

D. Wang, W. Li, X. Dong, H. Li, L. Hu, TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion, Journal of Chemical Information and Modeling 63 (3) (2023) 782–793.doi:10.1021/acs. jcim.2c01283. URLhttps://pubs.acs.org/doi/10.1021/acs.jcim.2c01283

work page doi:10.1021/acs 2023

[74] [74]

Naseem, M

U. Naseem, M. Khushi, J. Kim, Vision-Language Transformer for Interpretable Pathology Visual Question Answering, IEEE Journal of Biomedical and Health Informatics 27 (4) (2023) 1681–1690.doi:10.1109/JBHI.2022.3163751. URLhttps://ieeexplore.ieee.org/document/9745795/

work page doi:10.1109/jbhi.2022.3163751 2023

[75] [75]

S. Xu, W. Zhang, F. Zhang, Multi-Granular BERT: An Interpretable Model Ap- plicable to Internet-of-Thing devices, in: 2020 IEEE International Conference on Energy Internet (ICEI), IEEE, Sydney, NSW, Australia, 2020, pp. 134–139. doi:10.1109/ICEI49372.2020.00032. URLhttps://ieeexplore.ieee.org/document/9270262/

work page doi:10.1109/icei49372.2020.00032 2020

[76] [76]

Janssens, L

B. Janssens, L. Schetgen, M. Bogaert, M. Meire, D. Van Den Poel, 360 Degrees rumor detection: When explanations got some explaining to do, European Journal of Operational Research 317 (2) (2024) 366–381.doi:10.1016/j.ejor.2023.06. 024. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0377221723004769

work page doi:10.1016/j.ejor.2023.06 2024

[77] [77]

P. Ding, Y. Wang, X. Zhang, X. Gao, G. Liu, B. Yu, DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks com- bining sequence and shape, Briefings in Bioinformatics 24 (4) (2023) bbad231. doi:10.1093/bib/bbad231. URLhttps://academic.oup.com/bib/article/doi/10.1093/bib/bbad231/ 7199560

work page doi:10.1093/bib/bbad231 2023

[78] [78]

Feucht, Z

M. Feucht, Z. Wu, S. Althammer, V. Tresp, Description-based Label Attention Classifier for Explainable ICD-9 Classification, arXiv:2109.12026 [cs] (Sep. 2021). doi:10.48550/arXiv.2109.12026. URLhttp://arxiv.org/abs/2109.12026

work page doi:10.48550/arxiv.2109.12026 2021

[79] [79]

Kumar, V

P. Kumar, V. Kaushik, B. Raman, Towards the Explainability of Multimodal Speech Emotion Recognition, in: Interspeech 2021, ISCA, 2021, pp. 1748–1752. doi:10.21437/Interspeech.2021-1718. URLhttps://www.isca-archive.org/interspeech_2021/kumar21d_ interspeech.html

work page doi:10.21437/interspeech.2021-1718 2021

[80] [80]

Ullah, A

F. Ullah, A. Alsirhani, M. M. Alshahrani, A. Alomari, H. Naeem, S. A. Shah, Explainable Malware Detection System Using Transformers-Based Transfer Learn- ing and Multi-Model Visual Representation, Sensors 22 (18) (2022) 6766.doi: 10.3390/s22186766. 53

work page doi:10.3390/s22186766 2022