Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
Pith reviewed 2026-05-19 00:30 UTC · model grok-4.3
The pith
Evaluation methods for XAI in multimodal attention models are inconsistent and overlook modality-specific factors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Studies predominantly address vision-language models and employ attention-based explanation algorithms, but these approaches fall short of capturing the full range of modality interactions because of architectural heterogeneity across domains. The authors respond with recommendations for rigorous, transparent, and standardized evaluation and reporting practices to advance more interpretable and accountable multimodal AI.
What carries the argument
Multi-dimensional analysis of the literature across model architecture, modalities, explanation algorithms, and evaluation methodologies.
If this is right
- Standardized evaluation protocols would increase consistency when comparing XAI techniques across multimodal tasks.
- Explanation algorithms would need to be redesigned to capture interactions between modalities more completely.
- Evaluations would incorporate modality-specific cognitive and contextual factors as a required component.
- Transparent reporting standards would improve accountability in multimodal AI development.
Where Pith is reading between the lines
- Inconsistent evaluations could delay safe adoption of multimodal models in domains that require high trust, such as medical imaging or autonomous navigation.
- The observed emphasis on vision-language pairs leaves explainability in other combinations, like audio-text or sensor fusion, relatively unexamined.
- Following the recommendations could enable direct benchmarking of different XAI approaches and speed cumulative progress in the area.
Load-bearing premise
The literature search from January 2020 to early 2024 and the chosen analysis dimensions capture a representative and unbiased sample of research in the field.
What would settle it
A new survey that identifies multiple studies sharing the same consistent, robust evaluation protocol that explicitly incorporates modality-specific cognitive and contextual factors would challenge the claim.
Figures
read the original abstract
Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a systematic literature review of explainability research on multimodal attention-based models, covering publications from January 2020 to early 2024. It analyzes the selected works along the dimensions of model architecture, involved modalities, explanation algorithms, and evaluation methodologies. The central finding is that evaluation methods in this area are largely non-systematic, lacking consistency, robustness, and attention to modality-specific cognitive or contextual factors; the paper concludes with a set of recommendations for more rigorous and standardized practices.
Significance. If the sampled literature accurately reflects the field, the review usefully documents gaps in evaluation rigor for multimodal XAI and supplies concrete recommendations that could help standardize future work. The multi-dimensional framing (architecture, modalities, algorithms, evaluation) is a constructive organizing device for the synthesis.
major comments (1)
- [Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.
minor comments (2)
- [Abstract] Abstract: 'mulitmodal' is a typographical error and should read 'multimodal'.
- Throughout the results sections, some tables summarizing evaluation metrics would benefit from an additional column or footnote clarifying whether the reported metrics are quantitative, qualitative, or human-subject based, to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our systematic literature review. We address the major comment point by point below, providing the strongest honest defense of the manuscript while agreeing where revisions are warranted to improve transparency and reproducibility.
read point-by-point responses
-
Referee: [Section 2] Section 2 (Literature Search and Selection Criteria): The description of the search strategy, databases, exact Boolean strings, and inclusion/exclusion criteria is not detailed enough to allow an independent assessment of coverage. This directly affects the load-bearing claim that evaluation methods are 'largely non-systematic' across multimodal XAI, because an under-sampling of non-attention-based or non-vision-language work (e.g., audio-sensor or time-series multimodal models) could artifactually produce the observed pattern.
Authors: We thank the referee for this observation. We agree that greater specificity in Section 2 is needed to support independent verification of our search coverage and to strengthen the foundation for our central claim. In the revised manuscript we will expand this section to include: the complete list of databases (IEEE Xplore, ACM Digital Library, Scopus, Web of Science, arXiv, and Google Scholar), the exact Boolean search strings employed (combinations of terms such as 'multimodal attention' AND ('explainability' OR 'XAI' OR 'interpretability') AND modality-specific keywords), the full inclusion/exclusion criteria with justifications, the number of records at each screening stage, and a PRISMA flow diagram. Regarding potential under-sampling, our protocol was deliberately scoped to attention-based multimodal models with explainability components; while vision-language tasks dominate the retrieved literature, we did include qualifying studies involving audio, sensor, and time-series modalities. We will add an explicit limitations subsection discussing the observed distribution across modalities and its implications for generalizing the evaluation-method findings. These changes will make the sampling transparent and allow readers to assess whether the reported patterns in evaluation rigor are representative of the sampled corpus. revision: yes
Circularity Check
No circularity: systematic literature review without derivations or self-referential predictions
full rationale
This paper is a systematic literature review that synthesizes published work on explainability in multimodal attention-based models from January 2020 to early 2024. It examines dimensions such as architecture, modalities, explanation algorithms, and evaluation methodologies but contains no equations, fitted parameters, predictions, or derivation chains. The central claim about non-systematic evaluation methods follows directly from the authors' analysis of sampled papers rather than reducing to any self-definition, fitted input, or self-citation load-bearing step. The work is self-contained as an external synthesis with no internal circular reasoning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chosen time period and focus on attention-based multimodal models capture the relevant body of explainability research.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities... evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Table 6: Architecture Variants of Multimodal Attention-based Models... Early Summation, Early Concatenation, Hierarchical Multi-to-One, Single Cross-Attention Branch, Multi-Cross Attention
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A sur- vey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 1–42
work page 2018
-
[2]
A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward re- sponsible ai, Information fusion 58 (2020) 82–115
work page 2020
-
[3]
N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal of Artificial Intelligence Research 70 (2021) 245–317
work page 2021
-
[4]
G.Yang, Q.Ye, J.Xia, Unboxtheblack-boxforthemedicalexplainableaiviamulti- modal and multi-centre data fusion: A mini-review, two showcases and beyond, Information Fusion 77 (2022) 29–52
work page 2022
-
[5]
L. Nannini, A. Balayn, A. L. Smith, Explainability in ai policies: A critical review of communications, reports, regulations, and standards in the eu, us, and uk, in: Pro- ceedings of the 2023 ACM conference on fairness, accountability, and transparency, 2023, pp. 1198–1212
work page 2023
- [6]
-
[7]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13, Springer, 2014, pp. 740–755. 46
work page 2014
-
[8]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi- task benchmark and analysis platform for natural language understanding, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belg...
-
[9]
O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (roco): a multimodal image dataset, in: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, Springer, 2018, pp. 180–189
work page 2018
- [10]
-
[11]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan- tidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International journal of com- puter vision 123 (2017) 32–73
work page 2017
-
[12]
A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientific data 3 (1) (2016) 1–9
work page 2016
- [13]
-
[14]
L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning: concepts, cnn architectures, challenges, applications, future directions, Journal of big Data 8 (2021) 1–74
work page 2021
-
[15]
Y. Liu, P. Li, X. Hu, Combining context-relevant features with multi-stage atten- tion network for short text classification, Computer Speech & Language 71 (2022) 101268
work page 2022
-
[16]
Neural Machine Translation by Jointly Learning to Align and Translate
D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)
A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017)
work page 2017
-
[18]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[19]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2022, pp. 3202–3211. 47
work page 2022
-
[20]
P. Xu, X. Zhu, D. A. Clifton, Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 12113–12132
work page 2023
- [21]
-
[22]
ACL, 2020.ht tps://arxiv.org/abs/2005.00928
S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint arXiv:2005.00928 (2020)
- [23]
-
[24]
L. Parcalabescu, A. Frank, Mm-shap: A performance-agnostic metric for measuring multimodal contributions in vision and language models & tasks, arXiv preprint arXiv:2212.08158 (2022)
- [25]
-
[26]
Towards A Rigorous Science of Interpretable Machine Learning
F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learn- ing, arXiv preprint arXiv:1702.08608 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining ex- planations: An overview of interpretability of machine learning, in: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), IEEE, 2018, pp. 80–89
work page 2018
-
[28]
P. Fantozzi, M. Naldi, The explainability of transformers: Current status and di- rections, Computers 13 (4) (2024) 92
work page 2024
-
[29]
S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (3-4) (2021) 1–45
work page 2021
-
[30]
B. Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26
work page 2004
- [31]
- [32]
-
[33]
J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, A. Jain, Structured information extraction from scientific text with large language models, Nature Communications 15 (1) (2024) 1418. 48
work page 2024
-
[34]
K. R. Felizardo, M. S. Lima, A. Deizepe, T. U. Conte, I. Steinmacher, Chatgpt application in systematic literature reviews in software engineering: an evalua- tion of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement, 2024, pp. 25–36
work page 2024
-
[35]
A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The promise and challenges of using llms to accelerate the screening process of systematic reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 2024, pp. 262–271
work page 2024
-
[36]
A.Grattafiori, A.Dubey, A.Jauhri, A.Pandey, A.Kadian, A.Al-Dahle, A.Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Ad- vances in neural information processing systems 35 (2022) 24824–24837
work page 2022
-
[38]
C. Wohlin, Guidelines for snowballing in systematic literature studies and a repli- cation in software engineering, in: Proceedings of the 18th international conference on evaluation and assessment in software engineering, 2014, pp. 1–10
work page 2014
-
[39]
Emerging properties in self-supervised vision transformers
H. Chefer, S. Gur, L. Wolf, Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 2021, pp. 387–396.doi:10.1109/ICCV48922.2021.00045. URLhttps://ieeexplore.ieee.org/document/9710570/
- [40]
-
[41]
Y. Yang, L. Jiao, F. Liu, X. Liu, L. Li, P. Chen, S. Yang, An Explainable Spatial–Frequency Multiscale Transformer for Remote Sensing Scene Classifica- tion, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–15. doi:10.1109/TGRS.2023.3265361. URLhttps://ieeexplore.ieee.org/document/10097579/
-
[42]
Y. Huang, A. Jia, X. Zhang, J. Zhang, Generic Attention-model Explainability by Weighted Relevance Accumulation, in: ACM Multimedia Asia 2023, ACM, Tainan Taiwan, 2023, pp. 1–7.doi:10.1145/3595916.3626437. URLhttps://dl.acm.org/doi/10.1145/3595916.3626437
-
[43]
Y. Guo, F. Cai, H. Chen, C. Chen, X. Zhang, M. Zhang, An Explainable Recom- mendation Method based on Diffusion Model, in: 2023 9th International Conference on Big Data and Information Analytics (BigDIA), IEEE, Haikou, China, 2023, pp. 802–806.doi:10.1109/BigDIA60676.2023.10429319. URLhttps://ieeexplore.ieee.org/document/10429319/ 49
-
[44]
Z. Liang, Y. Zhao, M. Surdeanu, Using the Hammer only on Nails: A Hybrid Method for Representation-Based Evidence Retrieval for Question Answering, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Vol. 12656, Springer International Publishing, Cham, 2021, pp. 327–341, series Title: Le...
-
[45]
H. Wang, Y. Gao, Y. Bai, M. Lapata, H. Huang, Exploring Explainable Selection to Control Abstractive Summarization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (15) (2021) 13933–13941.doi:10.1609/aaai.v35i15. 17641. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/17641
-
[46]
I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, J. Weill, N. Koenigstein, Interpreting BERT-based Text Similarity via Activation and Saliency Maps, arXiv:2208.06612 [cs] (Aug. 2022).doi:10.48550/arXiv.2208.06612. URLhttp://arxiv.org/abs/2208.06612
-
[47]
J. Ferrando, M. R. Costa-jussà, Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions, arXiv:2109.05853 [cs] (Sep. 2021).doi:10.48550/arXiv.2109.05853. URLhttp://arxiv.org/abs/2109.05853
-
[48]
K. Zhang, L. Li, Explainable multimodal trajectory prediction using attention mod- els, Transportation Research Part C: Emerging Technologies 143 (2022) 103829. doi:10.1016/j.trc.2022.103829. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X22002509
-
[49]
M. Treviso, N. M. Guerreiro, R. Rei, A. F. T. Martins, IST-Unbabel 2021 Sub- mission for the Explainable Quality Estimation Shared Task, in: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 133–145. doi:10.18653/v1/2021.eval4nlp-1.14. URLhttps:...
-
[50]
S. Wang, Q. Zeng, W. Ni, C. Cheng, Y. Wang, ODP-Transformer: Interpretation of pest classification results using image caption generation techniques, Computers and Electronics in Agriculture 209 (2023) 107863.doi:10.1016/j.compag.2023. 107863. URLhttps://linkinghub.elsevier.com/retrieve/pii/S016816992300251X
-
[51]
J. Sun, S. Wang, J. Zhang, C. Zong, Neural Encoding and Decoding With Dis- tributed Sentence Representations, IEEE Transactions on Neural Networks and Learning Systems 32 (2) (2021) 589–603.doi:10.1109/TNNLS.2020.3027595. URLhttps://ieeexplore.ieee.org/document/9223750/
-
[52]
URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50
L.Yu, W.Xiang, J.Fang, Y.-P.P.Chen, L.Chi, eX-ViT:ANovelexplainablevision transformer for weakly supervised semantic segmentation, Pattern Recognition 142 (2023) 109666.doi:10.1016/j.patcog.2023.109666. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0031320323003679 50
-
[53]
M. Parelli, D. Mallis, M. Diomataris, V. Pitsikalis, Interpretable Visual Question Answering via Reasoning Supervision, arXiv:2309.03726 [cs] (Sep. 2023).doi: 10.48550/arXiv.2309.03726. URLhttp://arxiv.org/abs/2309.03726
-
[54]
E. Aflalo, M. Du, S.-Y. Tseng, Y. Liu, C. Wu, N. Duan, V. Lal, VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, arXiv:2203.17247 [cs] (Aug. 2022).doi:10.48550/arXiv.2203.17247. URLhttp://arxiv.org/abs/2203.17247
-
[55]
S. Katz, Y. Belinkov, VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers, arXiv:2305.13417 [cs] (Nov. 2023).doi:10.48550/arXiv. 2305.13417. URLhttp://arxiv.org/abs/2305.13417
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[56]
X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y.Choi, J.Gao, Oscar: Object-SemanticsAlignedPre-trainingforVision-Language Tasks, arXiv:2004.06165 [cs] (Jul. 2020).doi:10.48550/arXiv.2004.06165. URLhttp://arxiv.org/abs/2004.06165
-
[57]
R. K. Kandukuri, J. Achterhold, M. Moeller, J. Stueckler, Physical Represen- tation Learning and Parameter Identification from Video Using Differentiable Physics, International Journal of Computer Vision 130 (1) (2022) 3–16.doi: 10.1007/s11263-021-01493-5. URLhttps://link.springer.com/10.1007/s11263-021-01493-5
-
[58]
W. Sun, C. Wang, H. Wu, Y. Miao, H. Zhu, W. Guo, J. Li, DFYOLOv5m- M2transformer: Interpretation of vegetable disease recognition results using image dense captioning techniques, Computers and Electronics in Agriculture 215 (2023) 108460.doi:10.1016/j.compag.2023.108460. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0168169923008487
-
[59]
M. Rigotti, C. Miksovic, I. Giurgiu, T. Gschwind, P. Scotton, ATTENTION- BASED INTERPRETABILITY WITH CONCEPT TRANSFORMERS (2022)
work page 2022
-
[60]
Y. Heo, S. Kang, J. Seo, Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System, Sensors 23 (18) (2023) 7875. doi:10.3390/s23187875. URLhttps://www.mdpi.com/1424-8220/23/18/7875
-
[61]
URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480
J.Dong, S.Chen, M.Miralinaghi, T.Chen, P.Li, S.Labi, WhydidtheAImakethat decision? Towards an explainable artificial intelligence (XAI) for autonomous driv- ing systems, Transportation Research Part C: Emerging Technologies 156 (2023) 104358.doi:10.1016/j.trc.2023.104358. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0968090X23003480
-
[62]
A. H. Mohammadkhani, C. Tantithamthavorn, H. Hemmatif, Explaining Transformer-based Code Models: What Do They Learn? When They Do Not Work?, in: 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM), IEEE, Bogotá, Colombia, 2023, pp. 96–106. doi:10.1109/SCAM59687.2023.00020. URLhttps://ieeexplore.ieee.org/document...
-
[63]
R. Buoy, M. Iwamura, S. Srun, K. Kise, Explainable Connectionist-Temporal- Classification-Based Scene Text Recognition, Journal of Imaging 9 (11) (2023) 248. doi:10.3390/jimaging9110248. URLhttps://www.mdpi.com/2313-433X/9/11/248
-
[64]
M. Z. Boito, A. Villavicencio, L. Besacier, Investigating alignment interpretability for low-resource NMT, Machine Translation 34 (4) (2020) 305–323.doi:10.1007/ s10590-020-09254-w. URLhttp://link.springer.com/10.1007/s10590-020-09254-w
-
[65]
T. Chen, S. Liu, Z. Chen, W. Hu, D. Chen, Y. Wang, Q. Lyu, C. X. Le, W. Wang, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, Advances in Artificial Intelligence and Machine Learning 03 (03) (2023) 1369–1388.doi:10.54364/AAIML.2023.1181. URLhttps://www.oajaiml.com/uploads/archivepdf/50081181.pdf
- [66]
- [67]
-
[68]
Y. Meng, W. Speier, M. K. Ong, C. W. Arnold, Bidirectional Representation Learn- ing From Transformers Using Multimodal Electronic Health Record Data to Predict Depression, IEEE Journal of Biomedical and Health Informatics 25 (8) (2021) 3121– 3129.doi:10.1109/JBHI.2021.3063721. URLhttps://ieeexplore.ieee.org/document/9369833/
-
[69]
D. Lee, C. H. Suh, J. Kim, W. Jung, C. Park, K.-H. Jung, S. T. Kong, W. H. Shim, H. Heo, S. J. Kim, Augmenting Magnetic Resonance Imaging with Tabular Fea- tures for Enhanced and Interpretable Medial Temporal Lobe Atrophy Prediction, in: A. Abdulkadir, D. R. Bathula, N. C. Dvornek, M. Habes, S. M. Kia, V. Ku- mar, T. Wolfers (Eds.), Machine Learning in Cl...
-
[70]
C. C. Ukwuoma, Z. Qin, M. Belal Bin Heyat, F. Akhtar, O. Bamisile, A. Y. Muaad, D. Addo, M. A. Al-antari, A hybrid explainable ensemble transformer encoder for pneumonia identification from chest X-ray images, Journal of Advanced Research 48 (2023) 191–211.doi:10.1016/j.jare.2022.08.021. URLhttps://linkinghub.elsevier.com/retrieve/pii/S2090123222002028
-
[71]
T. Chiewhawan, P. Vateekul, Explainable Deep Learning for Thai Stock Market Prediction Using Textual Representation and Technical Indicators, in: Proceed- ings of the 8th International Conference on Computer and Communications Man- agement, ACM, Singapore Singapore, 2020, pp. 19–23.doi:10.1145/3411174. 3411191. 52
-
[72]
URLhttps://link.springer.com/10.1007/s10489-022-04254-0
B.Wu, L.Wang, Y.-R.Zeng, Interpretabletourismdemandforecastingwithtempo- ral fusion transformers amid COVID-19, Applied Intelligence 53 (11) (2023) 14493– 14514.doi:10.1007/s10489-022-04254-0. URLhttps://link.springer.com/10.1007/s10489-022-04254-0
-
[73]
D. Wang, W. Li, X. Dong, H. Li, L. Hu, TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion, Journal of Chemical Information and Modeling 63 (3) (2023) 782–793.doi:10.1021/acs. jcim.2c01283. URLhttps://pubs.acs.org/doi/10.1021/acs.jcim.2c01283
work page doi:10.1021/acs 2023
-
[74]
U. Naseem, M. Khushi, J. Kim, Vision-Language Transformer for Interpretable Pathology Visual Question Answering, IEEE Journal of Biomedical and Health Informatics 27 (4) (2023) 1681–1690.doi:10.1109/JBHI.2022.3163751. URLhttps://ieeexplore.ieee.org/document/9745795/
-
[75]
S. Xu, W. Zhang, F. Zhang, Multi-Granular BERT: An Interpretable Model Ap- plicable to Internet-of-Thing devices, in: 2020 IEEE International Conference on Energy Internet (ICEI), IEEE, Sydney, NSW, Australia, 2020, pp. 134–139. doi:10.1109/ICEI49372.2020.00032. URLhttps://ieeexplore.ieee.org/document/9270262/
-
[76]
B. Janssens, L. Schetgen, M. Bogaert, M. Meire, D. Van Den Poel, 360 Degrees rumor detection: When explanations got some explaining to do, European Journal of Operational Research 317 (2) (2024) 366–381.doi:10.1016/j.ejor.2023.06. 024. URLhttps://linkinghub.elsevier.com/retrieve/pii/S0377221723004769
-
[77]
P. Ding, Y. Wang, X. Zhang, X. Gao, G. Liu, B. Yu, DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks com- bining sequence and shape, Briefings in Bioinformatics 24 (4) (2023) bbad231. doi:10.1093/bib/bbad231. URLhttps://academic.oup.com/bib/article/doi/10.1093/bib/bbad231/ 7199560
-
[78]
M. Feucht, Z. Wu, S. Althammer, V. Tresp, Description-based Label Attention Classifier for Explainable ICD-9 Classification, arXiv:2109.12026 [cs] (Sep. 2021). doi:10.48550/arXiv.2109.12026. URLhttp://arxiv.org/abs/2109.12026
-
[79]
P. Kumar, V. Kaushik, B. Raman, Towards the Explainability of Multimodal Speech Emotion Recognition, in: Interspeech 2021, ISCA, 2021, pp. 1748–1752. doi:10.21437/Interspeech.2021-1718. URLhttps://www.isca-archive.org/interspeech_2021/kumar21d_ interspeech.html
-
[80]
F. Ullah, A. Alsirhani, M. M. Alshahrani, A. Alomari, H. Naeem, S. A. Shah, Explainable Malware Detection System Using Transformers-Based Transfer Learn- ing and Multi-Model Visual Representation, Sensors 22 (18) (2022) 6766.doi: 10.3390/s22186766. 53
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.