pith. sign in

arxiv: 2508.04427 · v1 · submitted 2025-08-06 · 💻 cs.LG · cs.AI

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Pith reviewed 2026-05-19 00:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords explainable AImultimodal learningattention mechanismssystematic reviewevaluation methodologiesvision-language modelsXAI
0
0 comments X

The pith

Evaluation methods for XAI in multimodal attention models are inconsistent and overlook modality-specific factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This systematic review examines research published from January 2020 to early 2024 on explainability techniques for multimodal attention-based models. It organizes the literature by model architecture, input modalities, explanation algorithms, and evaluation approaches. The analysis finds heavy concentration on vision-language and language-only models that rely on attention mechanisms, yet these techniques rarely capture complete cross-modal interactions. The central observation is that current evaluation methods remain non-systematic, lack consistent metrics, and ignore cognitive or contextual differences tied to each modality. Readers would care because weak evaluation practices make it difficult to verify whether explanations actually support trustworthy decisions in applied multimodal systems.

Core claim

The paper establishes that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Studies predominantly address vision-language models and employ attention-based explanation algorithms, but these approaches fall short of capturing the full range of modality interactions because of architectural heterogeneity across domains. The authors respond with recommendations for rigorous, transparent, and standardized evaluation and reporting practices to advance more interpretable and accountable multimodal AI.

What carries the argument

Multi-dimensional analysis of the literature across model architecture, modalities, explanation algorithms, and evaluation methodologies.

If this is right

  • Standardized evaluation protocols would increase consistency when comparing XAI techniques across multimodal tasks.
  • Explanation algorithms would need to be redesigned to capture interactions between modalities more completely.
  • Evaluations would incorporate modality-specific cognitive and contextual factors as a required component.
  • Transparent reporting standards would improve accountability in multimodal AI development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Inconsistent evaluations could delay safe adoption of multimodal models in domains that require high trust, such as medical imaging or autonomous navigation.
  • The observed emphasis on vision-language pairs leaves explainability in other combinations, like audio-text or sensor fusion, relatively unexamined.
  • Following the recommendations could enable direct benchmarking of different XAI approaches and speed cumulative progress in the area.

Load-bearing premise

The literature search from January 2020 to early 2024 and the chosen analysis dimensions capture a representative and unbiased sample of research in the field.

What would settle it

A new survey that identifies multiple studies sharing the same consistent, robust evaluation protocol that explicitly incorporates modality-specific cognitive and contextual factors would challenge the claim.

Figures

Figures reproduced from arXiv: 2508.04427 by Janan Arslan, Md Raisul Kibria, S\'ebastien Lafond.

Figure 1
Figure 1. Figure 1: PRISMA flowchart for the number of selected studies during the different steps of the selection [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Publication per year were grouped into themes defined by application domains and tasks. Although each study was assigned to a single theme to streamline the analysis, it is important to note that both the themes and assigned studies are not strictly mutually exclusive and may span multiple disciplines (e.g., Natural Language Processing (NLP) and Translation vs. Question Answering and Summarization). The th… view at source ↗
Figure 3
Figure 3. Figure 3: Key Bibliometric Analytics of the Publications [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representation of modalities The least amount of work is targeted towards code modeling. These findings are aligned with other surveys on XAI (e.g., [40]). An important decision regarding the eligibility criteria in our study is the inclusion of “multichannel" modeling approaches. These criteria cover models that make decisions based on multiple inputs generated by processing the same source input. For ins… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of multimodal/generative and multichannel modeling approaches [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Primary training task objectives 12 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Block diagram illustrating various fusion architecture types: Early fusion (a, b); Hierarchical [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Classification and distribution of explanation algorithms [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decoder Layer Contribution Matrix in ALTI+ Method [87] [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Initially, the unimodal relevance scores are initialized as identity matrices, while bi￾modal (cross-modal) relevance maps are initialized with zeros. The attention map update 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual explanation using the multimodal attention-composite method by Chefer et al. for ob [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SHAP values to explain pathology images and corresponding questions in VQA tasks [74]. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Categories of Explanation Evaluation metrics and distribution [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Flow-graph for layer 10 attention from GPT-2 small in VISIT for an IOI task [55]. [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗
read the original abstract

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a systematic literature review of explainability research on multimodal attention-based models, covering publications from January 2020 to early 2024. It analyzes the selected works along the dimensions of model architecture, involved modalities, explanation algorithms, and evaluation methodologies. The central finding is that evaluation methods in this area are largely non-systematic, lacking consistency, robustness, and attention to modality-specific cognitive or contextual factors; the paper concludes with a set of recommendations for more rigorous and standardized practices.

Significance. If the sampled literature accurately reflects the field, the review usefully documents gaps in evaluation rigor for multimodal XAI and supplies concrete recommendations that could help standardize future work. The multi-dimensional framing (architecture, modalities, algorithms, evaluation) is a constructive organizing device for the synthesis.

major comments (1)
  1. [Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.
minor comments (2)
  1. [Abstract] Abstract: 'mulitmodal' is a typographical error and should read 'multimodal'.
  2. Throughout the results sections, some tables summarizing evaluation metrics would benefit from an additional column or footnote clarifying whether the reported metrics are quantitative, qualitative, or human-subject based, to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our systematic literature review. We address the major comment point by point below, providing the strongest honest defense of the manuscript while agreeing where revisions are warranted to improve transparency and reproducibility.

read point-by-point responses
  1. Referee: [Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.

    Authors: We thank the referee for this observation. We agree that greater specificity in Section 2 is needed to support independent verification of our search coverage and to strengthen the foundation for our central claim. In the revised manuscript we will expand this section to include: the complete list of databases (IEEE Xplore, ACM Digital Library, Scopus, Web of Science, arXiv, and Google Scholar), the exact Boolean search strings employed (combinations of terms such as 'multimodal attention' AND ('explainability' OR 'XAI' OR 'interpretability') AND modality-specific keywords), the full inclusion/exclusion criteria with justifications, the number of records at each screening stage, and a PRISMA flow diagram. Regarding potential under-sampling, our protocol was deliberately scoped to attention-based multimodal models with explainability components; while vision-language tasks dominate the retrieved literature, we did include qualifying studies involving audio, sensor, and time-series modalities. We will add an explicit limitations subsection discussing the observed distribution across modalities and its implications for generalizing the evaluation-method findings. These changes will make the sampling transparent and allow readers to assess whether the reported patterns in evaluation rigor are representative of the sampled corpus. revision: yes

Circularity Check

0 steps flagged

No circularity: systematic literature review without derivations or self-referential predictions

full rationale

This paper is a systematic literature review that synthesizes published work on explainability in multimodal attention-based models from January 2020 to early 2024. It examines dimensions such as architecture, modalities, explanation algorithms, and evaluation methodologies but contains no equations, fitted parameters, predictions, or derivation chains. The central claim about non-systematic evaluation methods follows directly from the authors' analysis of sampled papers rather than reducing to any self-definition, fitted input, or self-citation load-bearing step. The work is self-contained as an external synthesis with no internal circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The review rests on standard assumptions about literature coverage and categorization without introducing free parameters or new entities.

axioms (1)
  • domain assumption The chosen time period and focus on attention-based multimodal models capture the relevant body of explainability research.
    This premise defines the scope of the systematic review as described in the abstract.

pith-pipeline@v0.9.0 · 5797 in / 1112 out tokens · 57366 ms · 2026-05-19T00:30:03.785396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities... evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Table 6: Architecture Variants of Multimodal Attention-based Models... Early Summation, Early Concatenation, Hierarchical Multi-to-One, Single Cross-Attention Branch, Multi-Cross Attention

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 8 internal anchors

  1. [1]

    Guidotti, A

    R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A sur- vey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 1–42

  2. [2]

    A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward re- sponsible ai, Information fusion 58 (2020) 82–115

  3. [3]

    Burkart, M

    N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–317

  4. [4]

    G.Yang, Q.Ye, J.Xia, Unboxtheblack-boxforthemedicalexplainableaiviamulti- modal and multi-centre data fusion: A mini-review, two showcases and beyond, Information Fusion 77 (2022) 29–52

  5. [5]

    Nannini, A

    L. Nannini, A. Balayn, A. L. Smith, Explainability in ai policies: A critical review of communications, reports, regulations, and standards in the eu, us, and uk, in: Pro- ceedings of the 2023 ACM conference on fairness, accountability, and transparency, 2023, pp. 1198–1212

  6. [6]

    J. Chun, C. S. de Witt, K. Elkins, Comparative global ai regulation: Policy per- spectives from the eu, china, and the us, arXiv preprint arXiv:2410.21279 (2024)

  7. [7]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13, Springer, 2014, pp. 740–755. 46

  8. [8]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi- task benchmark and analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belg...

  9. [9]

    Pelka, S

    O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (roco): a multimodal image dataset, in: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, Springer, 2018, pp. 180–189

  10. [10]

    Antol, A

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

  11. [11]

    Krishna, Y

    R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan- tidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International journal of com- puter vision 123 (2017) 32–73

  12. [12]

    A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientific data 3 (1) (2016) 1–9

  13. [13]

    Saxena, J

    D. Saxena, J. Cao, Generative adversarial networks (gans) challenges, solutions, and future directions, ACM Computing Surveys (CSUR) 54 (3) (2021) 1–42

  14. [14]

    Alzubaidi, J

    L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning: concepts, cnn architectures, challenges, applications, future directions, Journal of big Data 8 (2021) 1–74

  15. [15]

    Y. Liu, P. Li, X. Hu, Combining context-relevant features with multi-stage atten- tion network for short text classification, Computer Speech & Language 71 (2022) 101268

  16. [16]

    Neural Machine Translation by Jointly Learning to Align and Translate

    D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014)

  17. [17]

    Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

    A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)

  18. [18]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  19. [19]

    Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2022, pp. 3202–3211. 47

  20. [20]

    P. Xu, X. Zhu, D. A. Clifton, Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 12113–12132

  21. [21]

    S. S. Sengar, A. B. Hasan, S. Kumar, F. Carroll, Generative artificial intelligence: A systematic review and applications, arXiv preprint arXiv:2405.11029 (2024)

  22. [22]

    ACL, 2020.ht tps://arxiv.org/abs/2005.00928

    S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)

  23. [23]

    Qiang, D

    Y. Qiang, D. Pan, C. Li, X. Li, R. Jang, D. Zhu, AttCAT: Explaining Transformers via Attentive Class Activation Tokens

  24. [24]

    Parcalabescu, A

    L. Parcalabescu, A. Frank, Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks, arXiv preprint arXiv:2212.08158 (2022)

  25. [25]

    Rodis, C

    N. Rodis, C. Sardianos, P. Radoglou-Grammatikis, P. Sarigiannidis, I. Varlamis, G. T. Papadopoulos, Multimodal explainable artificial intelligence: A comprehen- sive review of methodological advances and future research directions, IEEE Access (2024)

  26. [26]

    Towards A Rigorous Science of Interpretable Machine Learning

    F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learn- ing, arXiv preprint arXiv:1702.08608 (2017)

  27. [27]

    L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining ex- planations: An overview of interpretability of machine learning, in: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), IEEE, 2018, pp. 80–89

  28. [28]

    Fantozzi, M

    P. Fantozzi, M. Naldi, The explainability of transformers: Current status and di- rections, Computers 13 (4) (2024) 92

  29. [29]

    Mohseni, N

    S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (3-4) (2021) 1–45

  30. [30]

    Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

    B. Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26

  31. [31]

    Moher, A

    D. Moher, A. Liberati, J. Tetzlaff, D. G. Altman, Preferred reporting items for systematic reviews and meta-analyses: the prisma statement, Bmj 339 (2009)

  32. [32]

    Altmäe, A

    S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in scientific writing: a friend or a foe?, Reproductive BioMedicine Online 47 (1) (2023) 3–9

  33. [33]

    Dagdelen, A

    J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, A. Jain, Structured information extraction from scientific text with large language models, Nature Communications 15 (1) (2024) 1418. 48

  34. [34]

    K. R. Felizardo, M. S. Lima, A. Deizepe, T. U. Conte, I. Steinmacher, Chatgpt application in systematic literature reviews in software engineering: an evalua- tion of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement, 2024, pp. 25–36

  35. [35]

    Huotala, M

    A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The promise and challenges of using llms to accelerate the screening process of systematic reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 262–271

  36. [36]

    The Llama 3 Herd of Models

    A.Grattafiori, A.Dubey, A.Jauhri, A.Pandey, A.Kadian, A.Al-Dahle, A.Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024)

  37. [37]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Ad- vances in neural information processing systems 35 (2022) 24824–24837

  38. [38]

    C. Wohlin, Guidelines for snowballing in systematic literature studies and a repli- cation in software engineering, in: Proceedings of the 18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10

  39. [39]

    Emerging properties in self-supervised vision transformers

    H. Chefer, S. Gur, L. Wolf, Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 2021, pp. 387–396.doi:10.1109/ICCV48922.2021.00045. URLhttps://ieeexplore.ieee.org/document/9710570/

  40. [40]

    Nauta, J

    M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y. Schmitt, J. Schlötterer, M. Van Keulen, C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai, ACM Computing Sur- veys 55 (13s) (2023) 1–42

  41. [41]

    Y. Yang, L. Jiao, F. Liu, X. Liu, L. Li, P. Chen, S. Yang, An Explainable Spatial–Frequency Multiscale Transformer for Remote Sensing Scene Classifica- tion, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–15. doi:10.1109/TGRS.2023.3265361. URLhttps://ieeexplore.ieee.org/document/10097579/

  42. [42]

    Huang, A

    Y. Huang, A. Jia, X. Zhang, J. Zhang, Generic Attention-model Explainability by Weighted Relevance Accumulation, in: ACM Multimedia Asia 2023, ACM, Tainan Taiwan, 2023, pp. 1–7.doi:10.1145/3595916.3626437. URLhttps://dl.acm.org/doi/10.1145/3595916.3626437

  43. [43]

    Y. Guo, F. Cai, H. Chen, C. Chen, X. Zhang, M. Zhang, An Explainable Recom- mendation Method based on Diffusion Model, in: 2023 9th International Conference on Big Data and Information Analytics (BigDIA), IEEE, Haikou, China, 2023, pp. 802–806.doi:10.1109/BigDIA60676.2023.10429319. URLhttps://ieeexplore.ieee.org/document/10429319/ 49

  44. [44]

    Liang, Y

    Z. Liang, Y. Zhao, M. Surdeanu, Using the Hammer only on Nails: A Hybrid Method for Representation-Based Evidence Retrieval for Question Answering, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Vol. 12656, Springer International Publishing, Cham, 2021, pp. 327–341, series Title: Le...

  45. [45]

    H. Wang, Y. Gao, Y. Bai, M. Lapata, H. Huang, Exploring Explainable Selection to Control Abstractive Summarization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (15) (2021) 13933–13941.doi:10.1609/aaai.v35i15. 17641. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/17641

  46. [46]

    Malkiel, D

    I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, J. Weill, N. Koenigstein, Interpreting BERT-based Text Similarity via Activation and Saliency Maps, arXiv:2208.06612 [cs] (Aug. 2022).doi:10.48550/arXiv.2208.06612. URLhttp://arxiv.org/abs/2208.06612

  47. [47]

    Ferrando, M

    J. Ferrando, M. R. Costa-jussà, Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions, arXiv:2109.05853 [cs] (Sep. 2021).doi:10.48550/arXiv.2109.05853. URLhttp://arxiv.org/abs/2109.05853

  48. [48]

    Zhang, L

    K. Zhang, L. Li, Explainable multimodal trajectory prediction using attention mod- els, Transportation Research Part C: Emerging Technologies 143 (2022) 103829. doi:10.1016/j.trc.2022.103829. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X22002509

  49. [49]

    Treviso, N

    M. Treviso, N. M. Guerreiro, R. Rei, A. F. T. Martins, IST-Unbabel 2021 Sub- mission for the Explainable Quality Estimation Shared Task, in: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 133–145. doi:10.18653/v1/2021.eval4nlp-1.14. URLhttps:...

  50. [50]

    S. Wang, Q. Zeng, W. Ni, C. Cheng, Y. Wang, ODP-Transformer: Interpretation of pest classification results using image caption generation techniques, Computers and Electronics in Agriculture 209 (2023) 107863.doi:10.1016/j.compag.2023. 107863. URLhttps://linkinghub.elsevier.com/retrieve/pii/S016816992300251X

  51. [51]

    J. Sun, S. Wang, J. Zhang, C. Zong, Neural Encoding and Decoding With Dis- tributed Sentence Representations, IEEE Transactions on Neural Networks and Learning Systems 32 (2) (2021) 589–603.doi:10.1109/TNNLS.2020.3027595. URLhttps://ieeexplore.ieee.org/document/9223750/

  52. [52]

    URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

    L.Yu, W.Xiang, J.Fang, Y.-P.P.Chen, L.Chi, eX-ViT:ANovelexplainablevision transformer for weakly supervised semantic segmentation, Pattern Recognition 142 (2023) 109666.doi:10.1016/j.patcog.2023.109666. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50

  53. [53]

    Parelli, D

    M. Parelli, D. Mallis, M. Diomataris, V. Pitsikalis, Interpretable Visual Question Answering via Reasoning Supervision, arXiv:2309.03726 [cs] (Sep. 2023).doi: 10.48550/arXiv.2309.03726. URLhttp://arxiv.org/abs/2309.03726

  54. [54]

    Aflalo, M

    E. Aflalo, M. Du, S.-Y. Tseng, Y. Liu, C. Wu, N. Duan, V. Lal, VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, arXiv:2203.17247 [cs] (Aug. 2022).doi:10.48550/arXiv.2203.17247. URLhttp://arxiv.org/abs/2203.17247

  55. [55]

    S. Katz, Y. Belinkov, VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers, arXiv:2305.13417 [cs] (Nov. 2023).doi:10.48550/arXiv. 2305.13417. URLhttp://arxiv.org/abs/2305.13417

  56. [56]

    X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y.Choi, J.Gao, Oscar: Object-SemanticsAlignedPre-trainingforVision-Language Tasks, arXiv:2004.06165 [cs] (Jul. 2020).doi:10.48550/arXiv.2004.06165. URLhttp://arxiv.org/abs/2004.06165

  57. [57]

    R. K. Kandukuri, J. Achterhold, M. Moeller, J. Stueckler, Physical Represen- tation Learning and Parameter Identification from Video Using Differentiable Physics, International Journal of Computer Vision 130 (1) (2022) 3–16.doi: 10.1007/s11263-021-01493-5. URLhttps://link.springer.com/10.1007/s11263-021-01493-5

  58. [58]

    W. Sun, C. Wang, H. Wu, Y. Miao, H. Zhu, W. Guo, J. Li, DFYOLOv5m- M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques, Computers and Electronics in Agriculture 215 (2023) 108460.doi:10.1016/j.compag.2023.108460. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0168169923008487

  59. [59]

    Rigotti, C

    M. Rigotti, C. Miksovic, I. Giurgiu, T. Gschwind, P. Scotton, ATTENTION- BASED INTERPRETABILITY WITH CONCEPT TRANSFORMERS (2022)

  60. [60]

    Y. Heo, S. Kang, J. Seo, Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System, Sensors 23 (18) (2023) 7875. doi:10.3390/s23187875. URLhttps://www.mdpi.com/1424-8220/23/18/7875

  61. [61]

    URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

    J.Dong, S.Chen, M.Miralinaghi, T.Chen, P.Li, S.Labi, WhydidtheAImakethat decision? Towards an explainable artificial intelligence (XAI) for autonomous driv- ing systems, Transportation Research Part C: Emerging Technologies 156 (2023) 104358.doi:10.1016/j.trc.2023.104358. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480

  62. [62]

    A. H. Mohammadkhani, C. Tantithamthavorn, H. Hemmatif, Explaining Transformer-based Code Models: What Do They Learn? When They Do Not Work?, in: 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM), IEEE, Bogotá, Colombia, 2023, pp. 96–106. doi:10.1109/SCAM59687.2023.00020. URLhttps://ieeexplore.ieee.org/document...

  63. [63]

    R. Buoy, M. Iwamura, S. Srun, K. Kise, Explainable Connectionist-Temporal- Classification-Based Scene Text Recognition, Journal of Imaging 9 (11) (2023) 248. doi:10.3390/jimaging9110248. URLhttps://www.mdpi.com/2313-433X/9/11/248

  64. [64]

    M. Z. Boito, A. Villavicencio, L. Besacier, Investigating alignment interpretability for low-resource NMT, Machine Translation 34 (4) (2020) 305–323.doi:10.1007/ s10590-020-09254-w. URLhttp://link.springer.com/10.1007/s10590-020-09254-w

  65. [65]

    T. Chen, S. Liu, Z. Chen, W. Hu, D. Chen, Y. Wang, Q. Lyu, C. X. Le, W. Wang, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, Advances in Artificial Intelligence and Machine Learning 03 (03) (2023) 1369–1388.doi:10.54364/AAIML.2023.1181. URLhttps://www.oajaiml.com/uploads/archivepdf/50081181.pdf

  66. [66]

    Jaitly, Q

    N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, S. Bengio, An online sequence-to-sequence model using partial conditioning, Advances in neural infor- mation processing systems 29 (2016)

  67. [67]

    attention

    R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, N. Jaitly, An analysis of" attention" in sequence-to-sequence models., in: Interspeech, 2017, pp. 3702–3706

  68. [68]

    Y. Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirectional Representation Learn- ing From Transformers Using Multimodal Electronic Health Record Data to Predict Depression, IEEE Journal of Biomedical and Health Informatics 25 (8) (2021) 3121– 3129.doi:10.1109/JBHI.2021.3063721. URLhttps://ieeexplore.ieee.org/document/9369833/

  69. [69]

    D. Lee, C. H. Suh, J. Kim, W. Jung, C. Park, K.-H. Jung, S. T. Kong, W. H. Shim, H. Heo, S. J. Kim, Augmenting Magnetic Resonance Imaging with Tabular Fea- tures for Enhanced and Interpretable Medial Temporal Lobe Atrophy Prediction, in: A. Abdulkadir, D. R. Bathula, N. C. Dvornek, M. Habes, S. M. Kia, V. Ku- mar, T. Wolfers (Eds.), Machine Learning in Cl...

  70. [70]

    C. C. Ukwuoma, Z. Qin, M. Belal Bin Heyat, F. Akhtar, O. Bamisile, A. Y. Muaad, D. Addo, M. A. Al-antari, A hybrid explainable ensemble transformer encoder for pneumonia identification from chest X-ray images, Journal of Advanced Research 48 (2023) 191–211.doi:10.1016/j.jare.2022.08.021. URLhttps://linkinghub.elsevier.com/retrieve/pii/S2090123222002028

  71. [71]

    Chiewhawan, P

    T. Chiewhawan, P. Vateekul, Explainable Deep Learning for Thai Stock Market Prediction Using Textual Representation and Technical Indicators, in: Proceed- ings of the 8th International Conference on Computer and Communications Man- agement, ACM, Singapore Singapore, 2020, pp. 19–23.doi:10.1145/3411174. 3411191. 52

  72. [72]

    URLhttps://link.springer.com/10.1007/s10489-022-04254-0

    B.Wu, L.Wang, Y.-R.Zeng, Interpretabletourismdemandforecastingwithtempo- ral fusion transformers amid COVID-19, Applied Intelligence 53 (11) (2023) 14493– 14514.doi:10.1007/s10489-022-04254-0. URLhttps://link.springer.com/10.1007/s10489-022-04254-0

  73. [73]

    D. Wang, W. Li, X. Dong, H. Li, L. Hu, TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion, Journal of Chemical Information and Modeling 63 (3) (2023) 782–793.doi:10.1021/acs. jcim.2c01283. URLhttps://pubs.acs.org/doi/10.1021/acs.jcim.2c01283

  74. [74]

    Naseem, M

    U. Naseem, M. Khushi, J. Kim, Vision-Language Transformer for Interpretable Pathology Visual Question Answering, IEEE Journal of Biomedical and Health Informatics 27 (4) (2023) 1681–1690.doi:10.1109/JBHI.2022.3163751. URLhttps://ieeexplore.ieee.org/document/9745795/

  75. [75]

    S. Xu, W. Zhang, F. Zhang, Multi-Granular BERT: An Interpretable Model Ap- plicable to Internet-of-Thing devices, in: 2020 IEEE International Conference on Energy Internet (ICEI), IEEE, Sydney, NSW, Australia, 2020, pp. 134–139. doi:10.1109/ICEI49372.2020.00032. URLhttps://ieeexplore.ieee.org/document/9270262/

  76. [76]

    Janssens, L

    B. Janssens, L. Schetgen, M. Bogaert, M. Meire, D. Van Den Poel, 360 Degrees rumor detection: When explanations got some explaining to do, European Journal of Operational Research 317 (2) (2024) 366–381.doi:10.1016/j.ejor.2023.06. 024. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0377221723004769

  77. [77]

    P. Ding, Y. Wang, X. Zhang, X. Gao, G. Liu, B. Yu, DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks com- bining sequence and shape, Briefings in Bioinformatics 24 (4) (2023) bbad231. doi:10.1093/bib/bbad231. URLhttps://academic.oup.com/bib/article/doi/10.1093/bib/bbad231/ 7199560

  78. [78]

    Feucht, Z

    M. Feucht, Z. Wu, S. Althammer, V. Tresp, Description-based Label Attention Classifier for Explainable ICD-9 Classification, arXiv:2109.12026 [cs] (Sep. 2021). doi:10.48550/arXiv.2109.12026. URLhttp://arxiv.org/abs/2109.12026

  79. [79]

    Kumar, V

    P. Kumar, V. Kaushik, B. Raman, Towards the Explainability of Multimodal Speech Emotion Recognition, in: Interspeech 2021, ISCA, 2021, pp. 1748–1752. doi:10.21437/Interspeech.2021-1718. URLhttps://www.isca-archive.org/interspeech_2021/kumar21d_ interspeech.html

  80. [80]

    Ullah, A

    F. Ullah, A. Alsirhani, M. M. Alshahrani, A. Alomari, H. Naeem, S. A. Shah, Explainable Malware Detection System Using Transformers-Based Transfer Learn- ing and Multi-Model Visual Representation, Sensors 22 (18) (2022) 6766.doi: 10.3390/s22186766. 53

Showing first 80 references.