pith. sign in

arxiv: 2606.25343 · v2 · pith:ZBFQUHEPnew · submitted 2026-06-24 · 💻 cs.CV

Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity

Pith reviewed 2026-06-29 05:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords document retrievalvisual question answeringvisual homogeneityinvoice processingretrieval augmented generationmulti-document benchmarkembedding collapse
0
0 comments X

The pith

A hybrid text-visual retrieval method reaches 60 percent recall on collections of visually identical invoices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that existing multi-document benchmarks mix varied document types and therefore underestimate the retrieval difficulty created when thousands of records share the same visual layout. It presents Invoice Haystack, 1,500 anonymized invoices paired with 200 questions, engineered so that the mean pairwise cosine similarity reaches 0.73 and embeddings collapse. The authors introduce VL-RAG, a framework that fuses text and visual embeddings before applying a vision-language model verification filter, and report that this dual-stream approach raises Recall@1 to 60.0 percent on the Invoice Haystack-500 split. The improvement holds on two prior benchmarks as well. If the claim holds, retrieval systems for enterprise document stores would need to combine modalities rather than rely on any single embedding space.

Core claim

VL-RAG jointly uses text and visual embeddings to retrieve documents from visually homogeneous collections, then applies a VLM-based verification filter; this yields 60.0 percent Recall@1 on Invoice Haystack-500 and improves results on DocHaystack-1000 and InfoHaystack-1000, establishing dual-stream fusion as a consistently superior strategy.

What carries the argument

VL-RAG, the hybrid retrieval-augmented generation framework that fuses text and visual embeddings followed by a VLM verification filter.

If this is right

  • Dual-stream fusion raises Recall@1 by up to 13.5 absolute points on the new homogeneous benchmark.
  • The same fusion also improves retrieval on the heterogeneous DocHaystack-1000 and InfoHaystack-1000 sets.
  • A verification filter after embedding retrieval is required for precise identification when visual templates overlap.
  • Future retrieval systems for templated documents should combine modalities instead of using a single embedding stream.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that deliberately increase visual homogeneity may become the standard test for enterprise document systems.
  • The approach could extend to other high-volume templated domains such as medical forms or financial statements.
  • Scaling the verification filter to collections larger than 500 documents would test whether the reported gains persist.

Load-bearing premise

The mean pairwise cosine similarity of 0.73 in the benchmark accurately reflects the retrieval difficulty found in real enterprise repositories that contain thousands of records sharing identical visual templates.

What would settle it

Measure Recall@1 when the same method is run on an enterprise invoice archive of several thousand records that share one visual template; if performance falls below 50 percent the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.25343 by Basim Azam, Heethanjan Kanagalingam, Mokeeshan Vathanakumar, Naveed Akhtar, Sarah Monazam Erfani, Thenukan Pathmanathan.

Figure 1
Figure 1. Figure 1: Comparison of data domains across visual retrieval benchmarks. Unlike gen￾eral visual question answering datasets like RetVQA [56], and WebVQA [11], which rely on natural imagery, or Document Haystack [12], which aggregates diverse text documents, our proposed Invoice Haystack (right) targets the specialized domain of financial documentation. This focused scope challenges models to distinguish between visu… view at source ↗
Figure 2
Figure 2. Figure 2: Invoice Haystack Generation Pipeline. Our dataset is curated through a com￾prehensive four-stage workflow. Step 1 (Invoice Image Anonymization) employs Gemini 3 Pro to sanitize raw invoice images, ensuring sensitive information is removed. Step 2 (Complex Q&A Generation) utilizes Gemini 2.5 Flash to analyze the anonymized documents and synthesize a raw dataset of question-answer pairs. Step 3 (AI Fil￾terin… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Distribution of question types in Invoice Haystack across eight semantic categories. (b) Spatial positioning of answers within the documents. enabling external correlation, and recognizable branding. Question quality is as￾sessed for discriminative power, clarity, and non-trivial reasoning requirements. Answer accuracy is confirmed by direct visual inspection of source images, en￾suring that answers ap… view at source ↗
Figure 4
Figure 4. Figure 4: Proposed VL-RAG architecture. The approach operates via two parallel streams. The Vision Stream (top) employs SigLIP and OpenCLIP to capture layout and structural features, while the Text Stream (bottom) uses OCR followed by BGE￾Large to encode semantic content. Similarity scores for all three encoders are averaged (SimAvg) to rank the corpus, and the top-m candidates undergo VLM-based binary verification … view at source ↗
Figure 5
Figure 5. Figure 5: Top-k selection ablation analysis for VQA on our Invoice Haystack benchmark. We evaluate the impact of different retrieval depths (k = 1, 3, 5) on VQA accuracy across all three corpus scales (Invoice Haystack-500, Invoice Haystack-1000, Invoice Haystack-1500) for four VLMs: Gemini-3-Flash-Preview, GPT-5.2, InternVL3-8B, and Qwen3-VL-8B-Instruct, all integrated with our VL-RAG framework. to 55.9% without th… view at source ↗
read the original abstract

Vision Language Models have achieved near-human performance on single-document Visual Question Answering, yet their effectiveness degrades significantly when retrieving information from large collections of visually homogeneous documents. Existing multi-document benchmarks aggregate diverse document types, creating artificial separation in embedding space that does not reflect enterprise document repositories where thousands of records share identical visual templates. We identify this as embedding collapse and introduce Invoice Haystack, a benchmark with 1,500 anonymized invoice images paired with 200 discriminative question-answer pairs, specifically designed to stress-test retrieval under strong visual homogeneity. Invoice Haystack exhibits a mean pairwise cosine similarity of 0.73, compared to 0.38 (DocHaystack) and 0.31 (InfoHaystack) in existing benchmarks, posing a fundamentally more challenging retrieval problem. Addressing the identified challenge, we propose VL-RAG, a hybrid retrieval-augmented generation framework that jointly leverages text and visual embeddings to harness the complementary strengths of both modalities, followed by a VLM-based verification filter for precise document identification. VL-RAG achieves 60.0\% Recall@1 on Invoice Haystack-500, outperforming existing state-of-the-art method by up to an absolute 13.5 percentage points. It further improves retrieval considerably on DocHaystack-1000 (77.1\% vs.\ 75.2\%) and InfoHaystack-1000 (84.5\% vs.\ 80.0\%), establishing the proposed dual-stream fusion as a consistently superior retrieval strategy across both homogeneous and heterogeneous document collections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Invoice Haystack, a benchmark of 1,500 anonymized invoice images paired with 200 QA pairs exhibiting high visual homogeneity (mean pairwise cosine similarity 0.73, vs. 0.38/0.31 in prior benchmarks), to address embedding collapse in enterprise multi-document VQA. It proposes VL-RAG, a hybrid RAG framework combining text and visual embeddings with VLM-based verification, reporting 60.0% Recall@1 on Invoice Haystack-500 (outperforming SOTA by up to 13.5pp) and gains on DocHaystack-1000 and InfoHaystack-1000.

Significance. If the empirical results hold under rigorous validation, the benchmark offers a targeted stress test for visually homogeneous documents that better reflects enterprise conditions than existing heterogeneous collections, while the dual-stream fusion in VL-RAG demonstrates a practical way to mitigate embedding collapse; the work would strengthen evaluation practices and retrieval strategies for template-heavy document corpora.

major comments (2)
  1. [Abstract] Abstract: the central claim that Invoice Haystack constitutes a 'fundamentally more challenging retrieval problem' rests on the mean pairwise cosine similarity of 0.73, yet no evidence is supplied that this value, the similarity distribution, or query-document alignment matches the scale, template repetition across years, or scan-quality variation of actual enterprise repositories containing thousands of records.
  2. [Abstract] Abstract: the headline performance numbers (60.0% Recall@1, +13.5pp) are presented without accompanying details on dataset splits, error bars, ablation studies, statistical significance tests, or the precise construction of the Invoice Haystack-500 subset, preventing verification of the reported gains.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Invoice Haystack constitutes a 'fundamentally more challenging retrieval problem' rests on the mean pairwise cosine similarity of 0.73, yet no evidence is supplied that this value, the similarity distribution, or query-document alignment matches the scale, template repetition across years, or scan-quality variation of actual enterprise repositories containing thousands of records.

    Authors: The 0.73 mean pairwise cosine similarity is offered as a quantitative indicator of greater visual homogeneity relative to DocHaystack (0.38) and InfoHaystack (0.31). Invoice Haystack is constructed from real anonymized invoices to reproduce the template repetition typical of enterprise settings. Direct matching to proprietary enterprise corpora at full scale is not feasible due to confidentiality. We will revise the abstract to replace 'fundamentally more challenging' with 'presents a more challenging retrieval problem under high visual homogeneity' and add an explicit limitations paragraph on the use of this proxy. revision: partial

  2. Referee: [Abstract] Abstract: the headline performance numbers (60.0% Recall@1, +13.5pp) are presented without accompanying details on dataset splits, error bars, ablation studies, statistical significance tests, or the precise construction of the Invoice Haystack-500 subset, preventing verification of the reported gains.

    Authors: The abstract is kept concise by design, but the full manuscript details benchmark construction and the Invoice Haystack-500 subset (a 500-document sample from the 1,500-image collection) in Section 3, experimental splits and protocol in Section 4, and ablations plus full results in Section 5. We will update the abstract to reference these sections and ensure error bars and significance tests appear in the results tables in the revision. revision: yes

standing simulated objections not resolved
  • Direct empirical evidence that the similarity distribution, query-document alignment, scale, multi-year template repetition, and scan-quality variation exactly match those of real enterprise repositories containing thousands of records cannot be supplied due to data-access and privacy constraints.

Circularity Check

0 steps flagged

No circularity; benchmark and method are independently constructed and evaluated

full rationale

The paper introduces Invoice Haystack as a new benchmark whose mean pairwise cosine similarity (0.73) is directly computed on the constructed 1,500-image collection and compared to prior benchmarks; VL-RAG is described as a hybrid of existing text and visual embeddings plus a VLM filter, with Recall@1 reported as an empirical evaluation on this benchmark and on DocHaystack/InfoHaystack. No equations, fitted parameters, self-citations, or ansatzes are present that would reduce the reported gains to a quantity defined by the authors' own prior choices or by construction. The derivation chain consists of standard retrieval evaluation and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that combining text and visual embeddings yields complementary signals and that a VLM verification step can reliably filter candidates; no free parameters, new entities, or non-standard axioms are introduced beyond standard embedding similarity.

axioms (1)
  • domain assumption Text and visual embeddings provide complementary information for document retrieval under visual homogeneity.
    Invoked when proposing the hybrid VL-RAG framework.

pith-pipeline@v0.9.1-grok · 5841 in / 1170 out tokens · 26010 ms · 2026-06-29T05:05:18.664028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    ACM Computing Surveys54(6), 1–35 (2021)

    van der Aalst, J., Weske, M., Grünbauer, D.: Workflow patterns in invoice process- ing automation. ACM Computing Surveys54(6), 1–35 (2021)

  2. [2]

    Abaskohi, A., Gella, S., Carenini, G., Laradji, I.H.: Fm2ds: Few-shot multimodal multihop data synthesis with knowledge distillation for question answering (2025), https://arxiv.org/abs/2412.07030

  3. [3]

    com/ocr-sdk/(2024), accessed: 2024-02-23

    ABBYY: ABBYY FineReader Engine: Advanced OCR SDK.https://www.abbyy. com/ocr-sdk/(2024), accessed: 2024-02-23

  4. [4]

    In: NeurIPS (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: A visual language model for few-shot learning. In: NeurIPS (2022)

  5. [5]

    Information Management Journal47(3), 28–41 (2023)

    Anderson, M., Davis, P., Wilson, K.: Survey of enterprise document management practices. Information Management Journal47(3), 28–41 (2023)

  6. [6]

    Anthropic: The claude 3 model family: Opus, sonnet, haiku. Tech. rep., Anthropic (2024)

  7. [7]

    Anthropic: Claude 3.7 sonnet model card. Tech. rep., Anthropic (2025),https: //www.anthropic.com/news/claude-3-7-sonnet

  8. [8]

    In: The Twelfth International Con- ference on Learning Representations (2023)

    Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-rag: Learning to retrieve, generate, and critique through self-reflection. In: The Twelfth International Con- ference on Learning Representations (2023)

  9. [9]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  10. [10]

    Nougat: Neural Optical Understanding for Academic Documents

    Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural optical under- standing for academic documents. arXiv preprint arXiv:2308.13418 (2023) 16 Heethanjan et al

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., Bisk, Y.: Webqa: Multihop and multimodal qa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16495–16504 (2022)

  12. [12]

    Chen, J., Xu, D., Fei, J., Feng, C.M., Elhoseiny, M.: Document haystacks: Vision- languagereasoningoverpilesof1000+documents.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 24817–24826 (2025)

  13. [13]

    In: Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing

    Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: Murag: Multimodal retrieval- augmented generator for open question answering over images and text. In: Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing. pp. 5558–5570 (2022)

  14. [14]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). p. 2818–2829. IEEE (Jun 2023).https://doi. org/10.1109/cvpr52729.2023.00276,http://dx.doi.o...

  15. [15]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  16. [16]

    arXiv preprint arXiv:2203.06482 (2022)

    Davis, L., Chen, S., Soricut, V., Bansal, M.: Visually-rich document understanding: Challenges and opportunities. arXiv preprint arXiv:2203.06482 (2022)

  17. [17]

    DeepMind, G.: Gemini 2.5 flash technical report. Tech. rep., Google DeepMind (2025),https://deepmind.google/models/gemini/flash

  18. [18]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

  19. [19]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. [20]

    Du, H., Zhang, J., Nan, G., Deng, W., Chen, Z., Zhang, C., Xiao, W., Huang, S., Pan, Y., Qi, T., Leng, S.: From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning (2025),https://arxiv.org/abs/2509.17040

  21. [21]

    unspecified

    Du, Y., Tian, M., Ronanki, S., Rongali, S., Bodapati, S., Galstyan, A., Wells, A., Schwartz, R., Huerta, E.A., Peng, H.: Context length alone hurts llm performance despite perfect retrieval. arXiv preprint arXiv:2510.05381 (2025)

  22. [22]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P.: Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449 (2024)

  23. [23]

    Google DeepMind: Gemini (2026),https://deepmind.google/models/gemini/, accessed: 2026-03-04

  24. [24]

    Google DeepMind: Gemini 3.1 pro model card (February 2026),https : / / deepmind.google/models/model-cards/gemini-3-1-pro/, accessed: 2026-03-04

  25. [25]

    Advances in Neural Information Processing Systems34, 39–50 (2021)

    Gu, J., Kuen, J., Morariu, V.I., Zhao, H., Jain, R., Barmpalios, N., Nenkova, A., Sun, T.: Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems34, 39–50 (2021)

  26. [26]

    In: International conference on machine learning

    Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval augmented language model pre-training. In: International conference on machine learning. pp. 3929–

  27. [27]

    PMLR (2020) Invoice Haystack 17

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23369–23379 (2023)

  29. [29]

    In: Proceedings of ACM Multimedia (2022)

    Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre-training for document ai with unified text and image masking. In: Proceedings of ACM Multimedia (2022)

  30. [30]

    In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume

    Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. pp. 874–880 (2021)

  31. [31]

    Journal of Machine Learning Research24(251), 1–43 (2023)

    Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi- Yu, J., Joulin, A., Riedel, S., Grave, E.: Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research24(251), 1–43 (2023)

  32. [32]

    In: EMNLP (2020)

    Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: EMNLP (2020)

  33. [33]

    Kashyap, S., Shirai, S., Mihindukulasooriya, N., Samulowitz, H.: Structtext: A synthetic table-to-text approach for benchmark generation with multi-dimensional evaluation (2025),https://arxiv.org/abs/2507.21340

  34. [34]

    In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards understanding 2d documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 4459–4469 (2018)

  35. [35]

    Kazemi, M., Dikkala, N., Anand, A., Devic, P., Dasgupta, I., Liu, F., Fatemi, B., Awasthi, P., Guo, D., Gollapudi, S., Qureshi, A.: Remi: A dataset for reasoning with multiple images (2024),https://arxiv.org/abs/2406.09175

  36. [36]

    In: SIGIR (2020)

    Khattab, O., Zaharia, M.: Colbert: Efficient and effective passage search via con- textualized late interaction over bert. In: SIGIR (2020)

  37. [37]

    In: European Conference on Computer Vision

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Conference on Computer Vision. pp. 498–517. Springer (2022)

  38. [38]

    arXiv preprint arXiv:2412.08802 (2024)

    Koukounas, A., Mastrapas, G., Eslami, S., Wang, B., Akram, M.K., Günther, M., Mohr, I., Sturua, S., Wang, N., Xiao, H.: jina-clip-v2: Multilingual multimodal embeddings for text and images. arXiv preprint arXiv:2412.08802 (2024)

  39. [39]

    Transactions of the Association for Compu- tational Linguistics7, 453–466 (2019)

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al.: Natural questions: a bench- mark for question answering research. Transactions of the Association for Compu- tational Linguistics7, 453–466 (2019)

  40. [40]

    In: International Conference on Machine Learning

    Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., Eisenschlos, J.M., Khandelwal, U., Shaw, P., Chang, M.W., Toutanova, K.: Pix2struct: Screenshot parsing as pretrain- ing for visual language understanding. In: International Conference on Machine Learning. pp. 18893–18912. PMLR (2023)

  41. [41]

    In: Proceedings of the 58th annual meeting of the association for computational linguistics

    Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy- anov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7871– 7880 (2020)

  42. [42]

    knowledge-intensive nlp tasks

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for 18 Heethanjan et al. knowledge-intensive nlp tasks. Advances in neural information processing systems 33, 9459–9474 (2020)

  43. [43]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  44. [44]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  45. [45]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  46. [46]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  47. [47]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  48. [48]

    Advances in neural information processing systems32(2019)

    Lu,J.,Batra,D.,Parikh,D.,Lee,S.:Vilbert:Pretrainingtask-agnosticvisiolinguis- tic representations for vision-and-language tasks. Advances in neural information processing systems32(2019)

  49. [49]

    In: Findings of the association for computational linguistics: ACL 2022

    Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

  50. [50]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022)

  51. [51]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021)

  52. [52]

    arXiv preprint arXiv:2504.08748 (2025)

    Mei, L., Mo, S., Yang, Z., Chen, C.: A survey of multimodal retrieval-augmented generation. arXiv preprint arXiv:2504.08748 (2025)

  53. [53]

    Nomic Embed: Training a Reproducible Long Context Text Embedder

    Nussbaum, Z., Morris, J.X., Duderstadt, B., Mulyar, A.: Nomic embed: Training a reproducible long context text embedder. arXiv preprint arXiv:2402.01613 (2024)

  54. [54]

    OpenAI: Gpt-4o: Multimodal capabilities with enhanced reasoning. Tech. rep., OpenAI (2024)

  55. [55]

    OpenAI API Documentation (2026),https://developers

    OpenAI: Models. OpenAI API Documentation (2026),https://developers. openai.com/api/docs/guides/latest-model, accessed: 2026-03-04

  56. [56]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  57. [57]

    arXiv preprint arXiv:2306.16713 (2023)

    Penamakuri, A.S., Gupta, M., Gupta, M.D., Mishra, A.: Answer mining from a pool of images: Towards retrieval-based visual question answering. arXiv preprint arXiv:2306.16713 (2023)

  58. [58]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  59. [59]

    Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond, vol. 4. Now Publishers Inc (2009) Invoice Haystack 19

  60. [60]

    In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR)

    Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162–1167. IEEE (2017)

  61. [61]

    Transactions of the Association for Computational Linguistics10, 376–392 (2022)

    Shen, Z., Lo, K., Wang, L.L., Kuehl, B., Weld, D.S., Downey, D.: VILA: Improv- ing structured content extraction from scientific PDFs using visual layout groups. Transactions of the Association for Computational Linguistics10, 376–392 (2022). https://doi.org/10.1162/tacl_a_00466,https://aclanthology.org/2022. tacl-1.22/

  62. [62]

    In: Document Engineering (2023)

    Smith, B., Johnson, R.: Template diversity in enterprise invoice repositories: An empirical study. In: Document Engineering (2023)

  63. [63]

    In: Ninth international con- ference on document analysis and recognition (ICDAR 2007)

    Smith, R.: An overview of the tesseract ocr engine. In: Ninth international con- ference on document analysis and recognition (ICDAR 2007). vol. 2, pp. 629–633. IEEE (2007)

  64. [64]

    In: ICLR (2021)

    Talmor, A., Yoran, O., Catav, A., Lahav, D., Wang, Y., Asai, A., Ilharco, G., Hajishirzi, H., Berant, J.: Multimodalqa: Complex question answering over text, tables and images. In: ICLR (2021)

  65. [65]

    In: Proceedings of the AAAI Conference on Artificial Intel- ligence

    Tanaka, R., Nishida, K., Yoshida, S.: Visualmrc: Machine reading comprehension on document images. In: Proceedings of the AAAI Conference on Artificial Intel- ligence. vol. 35, pp. 13878–13888 (2021)

  66. [66]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C., Bansal, M.: Unifying vision, text, and layout for universal document processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 19254–19264 (2023)

  67. [67]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  68. [68]

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  69. [69]

    In: FinNLP Workshop (2023)

    Wang, Z., Huang, M., Xu, Y., Chen, H.: Automated financial document processing: Challenges and opportunities. In: FinNLP Workshop (2023)

  70. [70]

    Wu, T.H., Biamby, G., Quenum, J., Gupta, R., Gonzalez, J.E., Darrell, T., Chan, D.M.: Visual haystacks: A vision-centric needle-in-a-haystack benchmark (2025), https://arxiv.org/abs/2407.13766

  71. [71]

    C-Pack: Packed Resources For General Chinese Embeddings

    Xiao, F., Luan, H., Zhao, M., Li, J., Wu, L., Xu, J.: Bge: A family of general text embeddings. arXiv preprint arXiv:2309.07597 (2023)

  72. [72]

    In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

    Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 1192–1200 (2020) 20 Heethanjan et al

  73. [73]

    In: Proceedings of the Annual Meeting of the Asso- ciation for Computational Linguistics (ACL) (2021)

    Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: Layoutlmv2: Multi-modal pre-training for visually- rich document understanding. In: Proceedings of the Annual Meeting of the Asso- ciation for Computational Linguistics (ACL) (2021)

  74. [74]

    Yan, S.Q., Gu, J.C., Zhu, Y., Ling, Z.H.: Corrective retrieval augmented generation (2024)

  75. [75]

    arXiv preprint arXiv:2211.12561 (2022)

    Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., Yih, W.t.: Retrieval-augmented multimodal language model- ing. arXiv preprint arXiv:2211.12561 (2022)

  76. [76]

    Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y., Zhao, C., Xu, G., Li, C., Tian, J., Qi, Q., Zhang, J., Huang, F.: mplug-docowl: Modularized multimodal large language model for document understanding (2023),https://arxiv.org/abs/ 2307.02499

  77. [77]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  78. [78]

    In: 2019 International conference on document analysis and recog- nition (ICDAR)

    Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recog- nition (ICDAR). pp. 1015–1022. IEEE (2019)

  79. [79]

    In: ICML (2024)

    Zhou, Y., Wang, X., Kan, M.: Understanding embedding collapse in vision- language models. In: ICML (2024)