pith. sign in

arxiv: 2605.06058 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Pith reviewed 2026-05-08 14:05 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords Document Visual Question AnsweringSelf-explainable modelsChain-of-explanationGrounded reasoningVision-language modelsExplainable AIPFL-DocVQAANLS metric
0
0 comments X

The pith

CoExVQA forces DocVQA models to localize answer regions before decoding to create self-explainable predictions with improved accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoExVQA, a framework for document visual question answering that follows a chain-of-explanation process. The model first finds question-relevant evidence, then localizes the specific answer region on the page, and finally generates the answer only from that region. This design makes the reasoning process transparent and verifiable by allowing inspection of each step in the chain. Readers would care because current DocVQA models are black boxes that do not show how they use visual evidence, limiting trust in their outputs for practical uses like information extraction from documents. Results indicate that this grounded approach reaches state-of-the-art performance for explainable systems on the PFL-DocVQA benchmark.

Core claim

CoExVQA implements a grounded reasoning process by first identifying question-relevant evidence, then explicitly localizing the answer region, and finally decoding the answer exclusively from the grounded region. This chain-of-explanation design enables direct inspection and verification of the reasoning process across modalities, achieving state-of-the-art explainable DocVQA performance on PFL-DocVQA with a 12% improvement in ANLS over current explainable baselines.

What carries the argument

The chain-of-explanation design, which sequences evidence identification, answer localization, and grounded answer decoding to enforce transparency.

If this is right

  • Predictions can be verified by checking the identified evidence and localized region against the question.
  • Accuracy improves because decoding is restricted to relevant visual evidence rather than the entire page.
  • The framework provides transparent and verifiable predictions suitable for applications requiring accountability.
  • It disentangles evidence identification from answer generation, reducing black-box behavior in vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might generalize to other multimodal tasks where grounding is important, such as visual question answering on images or videos.
  • By constraining the model to visual evidence, it could help mitigate issues like hallucination in language model outputs.
  • Testing the approach on additional DocVQA datasets would help establish its robustness beyond PFL-DocVQA.

Load-bearing premise

That requiring the model to localize the answer region before decoding will increase both accuracy and explainability without causing loss of necessary context or errors from incorrect localizations.

What would settle it

A direct counterexample would be if experiments on PFL-DocVQA showed that the chain-of-explanation model performs worse than or equal to non-grounded baselines in ANLS score, or if human evaluators find the localized regions do not match the actual answer locations used by the model.

Figures

Figures reproduced from arXiv: 2605.06058 by Adrian Duric, Ali Ramezani-Kebrya, Changkyu Choi, Kjetil Indrehus.

Figure 1
Figure 1. Figure 1: Overview of the CoExVQA prediction pipeline. view at source ↗
Figure 2
Figure 2. Figure 2: Re-encoding variants. The two re-encoding strategies used to refine the model’s focus on the predicted answer region. scores obtained from the late-interaction similarity between the question-tokens and image-patches, spatially aligned to the backbone’s patch grid (512 patches by default). Following DocVXQA, we treat the priors HQ as weak spatial supervision for learning the question–evidence alignment hea… view at source ↗
Figure 3
Figure 3. Figure 3: CoExVQA Example Prediction. Question given to the model: “What is the name of the University?". 3a shows the original document given to the document. 3b shows the question heatmap overlay predicted by the model over the document. The model highlights “Vanderbilt" with high correlation. 3c shows how the predicted answer location is correctly aligned with the ground truth answer location. From the answer reg… view at source ↗
Figure 4
Figure 4. Figure 4: User evaluation results. (a) Identification and answer recovery rates for correct and incorrect model predictions. (b) Perceived faithfulness, trust, and usability (7-point Likert). † denotes a negatively coded statement. Explanations enable participants to distinguish correct from incorrect predictions and recover the model’s answer, while perceived faithfulness and usability score favourably. 5 Conclusio… view at source ↗
Figure 5
Figure 5. Figure 5: Reason distribution by OCR engine. Distribution of selected match reasons for (Left) PaddleOCR and (Right) Amazon Textract OCR on train and validation splits. Each table shows the distribution of the used answer-match methods against each other. Reason Train Validation exact_digits 147 (18.1%) 21 (18.6%) exact_norm 431 (53.1%) 67 (59.3%) fuzzy_norm 101 (12.4%) 15 (13.3%) substring_digits 19 (2.3%) 1 (0.9%)… view at source ↗
Figure 6
Figure 6. Figure 6: OCR utilization and audit (Left) Distribution of which OCR engine produced an acceptable match for answer localization in the train and validation splits. (Right) Quantitative outcome of a manual visual audit of 200 randomly sampled validation examples, labelled as Correct (answer fully inside the predicted box), Partial (answer inside but box misses some tokens), or Incorrect (answer not covered or wrong … view at source ↗
Figure 7
Figure 7. Figure 7: Paddle examples where the red bounding box denotes the predicted location. Document information is appended as a header to the document (ID, Question, Answer, Selected method for answer location). (a) Correct location is in the box below. The text is missing connec￾tions, and would require manual an￾notation to locate. (b) Correct answer is “U.S.". This information is only available on the stamp, around th… view at source ↗
Figure 8
Figure 8. Figure 8: Examples where the pipeline the failed to locate the answer location. Document information is appended as a header to the document (ID, Question, Answer, Selected method for answer location). Effect of human annotation priors. In addition to the OCR audit, we manually annotated 300 examples in the DocVQA examples to measure the performance change under re-training manually annotating examples. We annotate … view at source ↗
Figure 9
Figure 9. Figure 9: Question prior from ColSmol-500M: document (left), raw prior (middle), post-processed prior (right). Question Prior Evaluation. We evaluated the different generated outputs based on two key perspectives: (1) whether the question prior places mass on the ground-truth answer region, and (2) 4https://huggingface.co/vidore/colSmol-500M 5https://huggingface.co/vidore/colqwen2.5-v0.2 6https://huggingface.co/goog… view at source ↗
Figure 10
Figure 10. Figure 10: Question prior from ColQwen2.5: document (left), raw prior (middle), post-processed prior (right) view at source ↗
Figure 11
Figure 11. Figure 11: Question prior from Pix2Struct cross-attention: document (left), raw prior (middle), post-processed prior (right). how much irrelevant information the prior able to suppress. A question-relevant heatmap should be selective and highlight only a portion of the page. We emphasise that pure question-answer overlap alone are not a complete metric to evaluate question-relevancy. Question-relevant information ca… view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end projector flow. Inputs E, Q (embeddings) enter the Projector, and the output is the predicted question heatmap HˆQ, or the predicted answer bounding box ˆbA. The Fusion, Context Aggregation and Feed-forward Network (FFN) blocks keep the shape of the original embeddings E. The task-head reduces the prediction to answer localization bounding box, or keeps it to a per-patch level with the question… view at source ↗
Figure 13
Figure 13. Figure 13: Training plots of CoExVQA. Top-left: total training loss. Top-right: total validation loss. Bottom-left: training projector loss (localization/prior objectives). Bottom-right: training decoder loss (text generation objective). The vertical line marks the end of the decoder-loss warmup, after which the decoder loss is applied at full weight. Curves are shown up to the early-stopping epoch (43). F Backbone … view at source ↗
Figure 14
Figure 14. Figure 14: Examples of three different examples with different AR. The three different models are from training in Appendix E. Figure 14a shows small AR but enough to fill the necessary context. Here, the DocVQA performance is lower. This might be due to text being cut off by patches (e.g the number 7, may be cut off to visually look like 1 within the answer region). Figure 14b shows a medium AR, and a stronger perf… view at source ↗
Figure 15
Figure 15. Figure 15: Masked question-evidence heatmaps when masking is applied to patches that overlap the question-prior region. The masking probability denotes the fraction of overlapping patches that are removed (set to zero). (a) Masking probability = 10% (b) Masking probability = 50% (c) Masking probability = 90% view at source ↗
Figure 16
Figure 16. Figure 16: Masked question-evidence heatmaps when masking is applied to patches outside the question-prior region (Non-QP). The masking probability denotes the fraction of non-overlapping patches that are removed. I Qualitative Analysis of Predictions We show some examples of predictions from our fully trained CoExVQA model. On the left are the original document, in the middle is the question heatmap, and on the rig… view at source ↗
Figure 17
Figure 17. Figure 17: Question: “What is the Expenses for Publications for 1987?". The model predicted from the answer region: “10,646", and the correct answer was “10,596". Inspecting the predicted answer region, one can confirm that the model found the correct answer region, but were not able to correctly decode the answer. This model variant had lower accuracy due to low AR ≈ 2.5, but provides high faithfulness and compact … view at source ↗
Figure 18
Figure 18. Figure 18: Question: “What is the name of the company mentioned at the top of the page?". The model predicted the correct answer from the answer region: “Johnson & Johnson and subsidiaries". The provided question heatmap seems more trivial at first glance, but it highlights part of “Johnson" and upper regions of the model. Predicted region are higher due to being from the best model with AR ≈ 19. The given context i… view at source ↗
Figure 19
Figure 19. Figure 19: Question: “What is the Net Pound Infeed?". The model predicted the correct answer from the answer region: “893". The provided question heatmap highlight the region in close proximity of the answer location. Predicted region are higher due to being from the best model with AR ≈ 19. The given context is enough for the decoder to correctly decode the answer. region by either (i) decoding the most logical ans… view at source ↗
Figure 20
Figure 20. Figure 20: Examples where the model correctly localizes the answer and decodes the correct answer. 34 view at source ↗
Figure 21
Figure 21. Figure 21: Examples where the model correctly localizes the answer and decodes the correct answer. (a) Question: “Which is the first exposure group on the plot?", Pre￾dicted Text Answer: “MC" and Ground Truth Answer: “MC". (b) Question: “What is Mr. Mc￾Coy’s date of birth?", Predicted Text Answer: “March 22, 1921" and Ground Truth Answer: “March 22, 1921". (c) Question: “What does AMA stand for?", Predicted Text An￾… view at source ↗
Figure 22
Figure 22. Figure 22: Examples where the model correctly localizes the answer and decodes the correct answer. 35 view at source ↗
Figure 23
Figure 23. Figure 23: Examples where the model partially includes the correct answer region. However, the model is still able to decode the correct text answer. Figure 23b cuts off the upper part of the signature, but the decoder is still able to recover the answer. (a) Question: “What is the name present in the letter drop ?", Predicted Text Answer: “Vir￾ginia Slims Superslims Consumer Testi" and Ground Truth Answer: “PHILIP … view at source ↗
Figure 24
Figure 24. Figure 24: Incorrect examples. In Figure 24a, the predicted answer location looks plausible, but overlaps the incorrect text span. The correct location is at the top of the page. Figures 24b and 24c both highlight parts of the document little to no question-relevant information, which lead to incorrect decoding. Most notably is Figure 24c, which predicts an region that contains no readable text. Despite being wrong,… view at source ↗
Figure 25
Figure 25. Figure 25: User evaluation instructions (page 1). 37 view at source ↗
Figure 26
Figure 26. Figure 26: User evaluation instructions (page 2). 38 view at source ↗
Figure 27
Figure 27. Figure 27: User evaluation instructions (page 3). 39 view at source ↗
Figure 28
Figure 28. Figure 28: User evaluation instructions (page 4). 40 view at source ↗
Figure 29
Figure 29. Figure 29: Participant demographics (17 participants). Part 1: Answer Justification. Participants were shown 6 examples (4 correct, 2 incorrect predic￾tions12) and asked whether the explanation sufficiently justified the prediction, whether they believed the model predicted correctly, and rated their confidence and the quality of individual explanation components on 7-point Likert scales. This yields 102 total evaluations view at source ↗
read the original abstract

Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CoExVQA, a self-explainable DocVQA framework employing a chain-of-explanation design: first identifying question-relevant evidence, then explicitly localizing the answer region on the page, and finally decoding the answer exclusively from that grounded region. The central empirical claim is that this restriction to grounded evidence yields state-of-the-art performance among explainable DocVQA models on the PFL-DocVQA dataset, delivering a 12% ANLS improvement over current explainable baselines while enabling direct inspection and verification of the multimodal reasoning process.

Significance. If the empirical results and the localization-explainability link are substantiated, the work would meaningfully advance explainable document understanding by disentangling evidence localization from answer generation and making the reasoning chain inspectable. This addresses a recognized limitation of black-box vision-language models in DocVQA and could support higher-trust applications in document analysis. The design also offers a concrete mechanism for verifiable predictions, which is a strength relative to purely post-hoc explanation methods.

major comments (2)
  1. Abstract: the claim of a 12% ANLS improvement and SotA explainable performance is presented without any description of the experimental setup, baselines, ablations, or error analysis of the localization module. This is load-bearing for the central claim, as the skeptic correctly notes that no verification is given that the localization step is accurate enough to avoid discarding useful context or introducing errors that would negate the reported gain.
  2. Abstract: the weakest assumption—that forcing exclusive decoding from a single localized region simultaneously improves accuracy and genuine explainability—is not tested. No ablation removing the restriction, no breakdown of multi-region questions, and no analysis of localization error rates on PFL-DocVQA are provided, leaving open the possibility that the observed improvement stems from other unstated factors rather than the chain-of-explanation design.
minor comments (2)
  1. Abstract: the phrase 'transparent and verifiable predictions' should be accompanied by a concrete example or figure showing how a user would inspect the chain (evidence identification, bounding box, and answer) to make the benefit explicit.
  2. The manuscript would benefit from a dedicated section or diagram illustrating the three-stage pipeline with input/output examples from PFL-DocVQA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and will make revisions to strengthen the manuscript, particularly by enhancing the abstract and adding requested analyses.

read point-by-point responses
  1. Referee: Abstract: the claim of a 12% ANLS improvement and SotA explainable performance is presented without any description of the experimental setup, baselines, ablations, or error analysis of the localization module. This is load-bearing for the central claim, as the skeptic correctly notes that no verification is given that the localization step is accurate enough to avoid discarding useful context or introducing errors that would negate the reported gain.

    Authors: We agree that the abstract would benefit from additional context to support the central empirical claim. In the revised manuscript, we will expand the abstract to briefly describe the PFL-DocVQA dataset, mention the explainable baselines compared against, and note that detailed ablations and localization error analysis will be added to the Experiments section to substantiate the reported gain. revision: yes

  2. Referee: Abstract: the weakest assumption—that forcing exclusive decoding from a single localized region simultaneously improves accuracy and genuine explainability—is not tested. No ablation removing the restriction, no breakdown of multi-region questions, and no analysis of localization error rates on PFL-DocVQA are provided, leaving open the possibility that the observed improvement stems from other unstated factors rather than the chain-of-explanation design.

    Authors: We acknowledge the importance of directly testing this core assumption. The manuscript does not currently include an ablation removing the exclusive decoding restriction. We will add this ablation in the revised version, along with a breakdown of multi-region questions and localization error rate analysis on PFL-DocVQA, to confirm that the improvements stem from the chain-of-explanation design. revision: yes

Circularity Check

0 steps flagged

No circularity: framework design and empirical gains are independent of inputs

full rationale

The paper introduces CoExVQA as an architectural framework (localize then decode exclusively from grounded region) whose claimed ANLS improvement is presented as an empirical outcome on PFL-DocVQA rather than a quantity derived by construction from fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are shown that reduce the prediction or the performance delta to the model's own inputs. The chain-of-explanation is a procedural design choice whose correctness is left to external validation on the benchmark, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities. The framework appears to build on standard vision-language model components, but no further decomposition is possible from the given text.

pith-pipeline@v0.9.0 · 5491 in / 1130 out tokens · 61195 ms · 2026-05-08T14:05:22.615352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence.Information Fusion, 99:101805, 2023

    Sajid Ali, Tamer Abuhmed, Shaker El-Sappagh, Khan Muhammad, Jose M Alonso-Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz-Rodríguez, and Francisco Herrera. Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence.Information Fusion, 99:101805, 2023

  2. [2]

    A survey of explainable artificial intelligence (xai) in financial time series forecasting.ACM Computing Surveys, 57(10):1–37, 2025

    Pierre-Daniel Arsenault, Shengrui Wang, and Jean-Marc Patenaude. A survey of explainable artificial intelligence (xai) in financial time series forecasting.ACM Computing Surveys, 57(10):1–37, 2025

  3. [3]

    Self-driving cars: A survey.Expert systems with applications, 165:113816, 2021

    Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Filipe Mutz, et al. Self-driving cars: A survey.Expert systems with applications, 165:113816, 2021

  4. [4]

    Survey on question answering over visually rich documents: Methods, challenges, and trends, 2025

    Camille Barboule, Benjamin Piwowarski, and Yoan Chabot. Survey on question answering over visually rich documents: Methods, challenges, and trends, 2025

  5. [5]

    Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K

    Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy- Chowdhury, and Srimat Chakradhar. Visual alignment of medical vision-language models for grounded radiology report generation, 2025

  6. [6]

    Ai in finance: challenges, techniques, and opportunities.ACM Computing Surveys (CSUR), 55(3):1–38, 2022

    Longbing Cao. Ai in finance: challenges, techniques, and opportunities.ACM Computing Surveys (CSUR), 55(3):1–38, 2022

  7. [7]

    Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

  8. [8]

    Fouhey, Joyce Chai, and Shengyi Qian

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Jianing Yang, David F. Fouhey, Joyce Chai, and Shengyi Qian. Multi-object hallucination in vision language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 44393–44418. Curran Associates,...

  9. [9]

    DIB-X: Formulating explainability principles for a self-explainable model through information theoretic learning

    Changkyu Choi, Shujian Yu, Michael Kampffmeyer, Arnt-Børre Salberg, Nils Olav Handegard, and Robert Jenssen. DIB-X: Formulating explainability principles for a self-explainable model through information theoretic learning. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7170–7174. IEEE, 2024

  10. [10]

    Paddleocr 3.0 technical report, 2025

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025

  11. [11]

    Attention grounded enhancement for visual document retrieval, 2025

    Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin, Junfeng Ma, and Keping Bi. Attention grounded enhancement for visual document retrieval, 2025

  12. [12]

    Pp-ocr: A practical ultra lightweight ocr system, 2020

    Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. Pp-ocr: A practical ultra lightweight ocr system, 2020. 10

  13. [13]

    Colpali: Efficient document retrieval with vision language models, 2025

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models, 2025

  14. [14]

    Safe artificial intelligence in finance.Finance Research Letters, 56:104088, 2023

    Paolo Giudici and Emanuela Raffinetti. Safe artificial intelligence in finance.Finance Research Letters, 56:104088, 2023

  15. [15]

    What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation, 2025

    Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, and Carsten Eickhoff. What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation, 2025

  16. [16]

    Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues.Array, 10:100057, 2021

    Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues.Array, 10:100057, 2021

  17. [17]

    Deep residual learning for im- age recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016

  18. [18]

    Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

    Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9992–10002, 2020

  19. [19]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, p...

  20. [20]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091, 2022

  21. [21]

    Towards self-explainable document visual question answering through infor- mation theoretic learning

    Kjetil Indrehus. Towards self-explainable document visual question answering through infor- mation theoretic learning. Msc thesis, informatics: Programming and systems architecture, University of Oslo, 2026. Submitted

  22. [22]

    A survey on vision-language-action models for autonomous driving, 2025

    Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, and Lijun Sun. A survey on vision-language-action models for autonomous driving, 2025

  23. [23]

    Explainability and vision foundation models: A survey.Information Fusion, 122:103184, 2025

    Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, and Gianni Franchi. Explainability and vision foundation models: A survey.Information Fusion, 122:103184, 2025

  24. [24]

    Colbert: Efficient and effective passage search via contextual- ized late interaction over bert

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextual- ized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

  25. [25]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InComputer Vision – ECCV 2022, pages 498–517, Cham, 2022. Springer Nature Switzerland

  26. [26]

    Vision-language model-based local interpretable model- agnostic explanations analysis for explainable in-vehicle controller area network intrusion detection.Sensors, 25(10), 2025

    Jaeseung Lee and Jehyeok Rew. Vision-language model-based local interpretable model- agnostic explanations analysis for explainable in-vehicle controller area network intrusion detection.Sensors, 25(10), 2025

  27. [27]

    Pix2struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. InInternational Confer- ence on Machine Learning, pages 18893–18912. PMLR, 2023

  28. [28]

    Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception

    Mengqi Lei, Siqi Li, Yihong Wu, and others. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025. 11

  29. [29]

    V . I. Levenshtein. Binary coodes capable of correcting deletions, insertions, and reversals. In Soviet physics-doklady, volume 10, 1966

  30. [30]

    Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions.Information Fusion, 106:102301, 2024

    Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, et al. Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions.Information Fusion, 106:102301, 2024

  31. [31]

    A unified approach to interpreting model predictions

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  32. [32]

    Learning visual question answering by bootstrapping hard attention

    Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. Learning visual question answering by bootstrapping hard attention. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

  33. [33]

    ColMate: Contrastive late interaction and masked text for multimodal document retrieval

    Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, Torsten Scholak, and Sai Rajeswar. ColMate: Contrastive late interaction and masked text for multimodal document retrieval. In Saloni Potdar, Lina Rojas-Barahona, and Sebastien Mon...

  34. [34]

    Minesh Mathew, Dimosthenis Karatzas, and C.V . Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2200–2209, January 2021

  35. [35]

    Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks.Computers in Biology and Medicine, 156:106668, 2023

    Sajid Nazir, Diane M Dickson, and Muhammad Usman Akram. Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks.Computers in Biology and Medicine, 156:106668, 2023

  36. [36]

    Hoang T. N. Nguyen, Dong Nie, Taivanbat Badamdorj, Yujie Liu, Yingying Zhu, Jason Truong, and Li Cheng. Automated generation of accurate & fluent medical x-ray reports, 2021

  37. [37]

    Multimodal explanations: Justifying decisions and pointing to the evidence

    Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788, 2018

  38. [38]

    Kosmos-2: Grounding multimodal large language models to the world, 2023

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023

  39. [39]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  40. [40]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  41. [41]

    Generalized intersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  42. [42]

    why should i trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016. 12

  43. [43]

    Privacy-aware document visual question answering

    Tito Rubèn, Khanh Nguyen, Marlon Tobabon, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Joonas Jälkö, Vincent Poulain d’Andecy, Aurelie Joseph, Lei Kang, et al. Privacy-aware document visual question answering. InInternational Conference on Document Analysis and Recognition, 2024

  44. [44]

    Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities.Knowledge-Based Systems, 263:110273, 2023

    Waddah Saeed and Christian Omlin. Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities.Knowledge-Based Systems, 263:110273, 2023

  45. [45]

    Bokoro, and Ravi Sharma

    Deepti Saraswat, Pronaya Bhattacharya, Ashwin Verma, Vivek Kumar Prasad, Sudeep Tanwar, Gulshan Sharma, Pitshou N. Bokoro, and Ravi Sharma. Explainable ai for healthcare 5.0: Opportunities and challenges.IEEE Access, 10:84486–84517, 2022

  46. [46]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 10 2017

  47. [47]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Proc...

  48. [48]

    Docvxqa: Context-aware visual explanations for document question answering

    Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, and Dimosthenis Karatzas. Docvxqa: Context-aware visual explanations for document question answering. InForty-second International Conference on Machine Learning, 2025

  49. [49]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

  50. [50]

    Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning, 2025

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning, 2025

  51. [51]

    On the faithfulness of vision transformer explanations

    Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, and Yan Yan. On the faithfulness of vision transformer explanations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10936–10945, 2024

  52. [52]

    Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding, 2021

    Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding, 2021

  53. [53]

    Survey on ex- plainable AI: From approaches, limitations and applications aspects.Human-Centric Intelligent Systems, 3:161–188, 2023

    Wenli Yang, Yuchen Wei, Hanyu Wei, Yanyu Chen, Guan Huang, Xiang Li, Renjie Li, Naimeng Yao, Xinyi Wang, Xiaotong Gu, Muhammad Bilal Amin, and Byeong Kang. Survey on ex- plainable AI: From approaches, limitations and applications aspects.Human-Centric Intelligent Systems, 3:161–188, 2023

  54. [54]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

    Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19187–19197, 6 2023

  55. [55]

    Tap: Text-aware pre-training for text-vqa and text-caption

    Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8751–8761, 2021

  56. [56]

    Coe: Chain-of-explanation via automatic visual concept circuit description and polysemanticity quantification

    Wenlong Yu, Qilong Wang, Chuang Liu, Dong Li, and Qinghua Hu. Coe: Chain-of-explanation via automatic visual concept circuit description and polysemanticity quantification. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4364–4374, June 2025. 13

  57. [57]

    Faithful by construction

    Yongxin Zhu, Zhen Liu, Yukang Liang, Xin Li, Hao Liu, Changcun Bao, and Linli Xu. Locate then generate: Bridging vision and language with bounding box for scene-text vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11479–11487, 2023. 14 Appendix The appendix contains the following supplementary material: • Appendix A...

  58. [58]

    Use Amazon Textract OCR to collect line texts with bounding boxes

  59. [59]

    For each OCR line ti, compute norm(ti) and dig(ti) and compare them to the answer a. We select the first matching line according to the priority: (1) exact match on norm, (2) exact match on dig, (3) substring match on norm, (4) substring match on dig, (5) fuzzy match onnorm(score≥τ text), and (6) fuzzy match ondig(score≥τ dig)

  60. [60]

    If this also fails, set the prior toNone

    If no acceptable match is found, run PaddleOCR and repeat step 2. If this also fails, set the prior toNone

  61. [61]

    B". The example was marked as “partial

    Convert the selected box to normalized coordinates [x1, y1, x2, y2]∈[0,1] 4, expand it by +10%inxand+15%iny, and clip to[0,1]. 2https://aws.amazon.com/textract/ 3NFKC, lowercasing, and replacing non-alphanumeric characters with white-space. 18 To better understand how the pipeline behaves in practice, we record which rule in Step 2 produced the selected m...