LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

Yanchu Guan; Yifan Zhu; Yue Lu; Yu Mi; Zhixuan Chu

arxiv: 2605.22829 · v1 · pith:D25LTXGZnew · submitted 2026-04-18 · 💻 cs.IR · cs.AI

LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

Yifan Zhu , Yu Mi , Yue Lu , Yanchu Guan , Zhixuan Chu This is my paper

Pith reviewed 2026-05-25 00:26 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords multimodal RAGlayout segmentationfine-grained retrievaldocument understandingblock-level retrievalretrieval-augmented generationLFDocQAmultimodal documents

0 comments

The pith

LFRAG replaces page-level retrieval in multimodal RAG with block-level units created by layout segmentation and fused via cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that coarse page-level retrieval in multimodal RAG misses fine semantic and layout details in documents and injects redundant context into generation. LFRAG addresses this by segmenting documents into layout-based blocks, encoding them with a semantic-layout fusion module, and retrieving at block level with late interaction. The resulting alignment improves answer accuracy and sharply cuts token use downstream. To support evaluation they release LFDocQA, a benchmark with block-level labels across document types. Experiments show the method reaches state-of-the-art retrieval while delivering a 7.20 percent accuracy gain and 73.07 percent token reduction versus prior baselines.

Core claim

LFRAG advances multimodal RAG from page-level to block-level retrieval by first applying layout segmentation to form semantically coherent fine-grained units, then feeding those units to a semantic-layout fusion encoder that merges local semantics with global context through cross-attention, and finally performing block-level late interaction retrieval to achieve precise query-content alignment while discarding irrelevant material for generation.

What carries the argument

layout segmentation into semantically coherent blocks combined with a semantic-layout fusion encoder and block-level late interaction retrieval

If this is right

LFRAG reaches state-of-the-art retrieval performance on LFDocQA.
Answer accuracy rises 7.20 percent over the strongest baseline.
Token consumption in generation drops 73.07 percent.
Query-content alignment becomes more precise by excluding irrelevant blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same segmentation-plus-fusion pattern could be tested on non-question-answering tasks such as document summarization or classification.
Dynamic adjustment of block boundaries according to query type might further reduce context size.
LFDocQA supplies a reusable testbed for any future fine-grained multimodal retrieval method.

Load-bearing premise

Layout segmentation reliably yields semantically coherent blocks that improve alignment without losing critical context.

What would settle it

If page-level retrieval matches or exceeds LFRAG accuracy and token efficiency on LFDocQA or an equivalent block-annotated benchmark, the claimed advantage of the block-level approach would be refuted.

Figures

Figures reproduced from arXiv: 2605.22829 by Yanchu Guan, Yifan Zhu, Yue Lu, Yu Mi, Zhixuan Chu.

**Figure 2.** Figure 2: The framework of LFRAG. Given a query, LFRAG retrieves relevant information from a document corpus by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Construction of LFDocQA Benchmark. LFDocQA consists of two subsets, LF-Docmatix and LFPaperTab, built via a semi-automatic pipeline to provide fine-grained annotations for retrieval and QA. The retrieved relevant blocks are denoted as B ∗ q = {B∗ 1 , B∗ 2 , . . . , B∗ k }, where k is the number of retrieved blocks. Subsequently, the query q and the retrieved relevant blocks B ∗ q are fed into a generatio… view at source ↗

**Figure 4.** Figure 4: Comparison of page-level retrieval performance [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of top-k on generation performance and token usage on LFDocQA. LFRAG achieves better answer quality with significantly fewer tokens compared to pagelevel retrieval baselines. contributes to the overall effectiveness. In particular, removing block aggregation leads to a noticeable drop in both NDCG@k (N@k) and Recall@k (R@k), indicating that constructing semantically coherent units is essential for… view at source ↗

**Figure 7.** Figure 7: Prompt for Downstream Answer Generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 6.** Figure 6: Prompt for automatic relevant block annotation in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt for grounded QA generation in LF-PaperTab. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 10.** Figure 10: Prompt for LLM-based QA Answer Evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 9.** Figure 9: Prompt for query filtering and rewriting. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 11.** Figure 11: Qualitative examples of LFRAG on LFDocQA. Each case shows the query, top-3 retrieved blocks retrieved by [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Representative examples from LFDocQA, including document pages, layout segmentation, aggregated semantic [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LFRAG pushes multimodal RAG to block-level retrieval with layout fusion and a new benchmark, but the headline gains sit entirely on that self-built dataset with no visible details on annotation or baseline fairness.

read the letter

The main takeaway is that this paper takes the practical step of moving from page-level to block-level retrieval in multimodal document RAG. It segments documents by layout into finer units, adds a cross-attention fusion encoder to mix local semantics with global layout context, and uses late interaction for retrieval. They also release LFDocQA, a benchmark with block-level labels across document types, to test both retrieval and QA at that granularity. The efficiency angle—cutting tokens while keeping accuracy—is the part that could matter for real systems handling reports or slides. That part of the framing is clear and addresses a common complaint about bloated context in current RAG pipelines. The reported numbers are a 7.2% accuracy lift over the best baseline and a 73% token drop in generation, plus SOTA retrieval on their test set. What is missing is any description of how the block annotations were produced, whether inter-annotator checks were done, or how the page-level baselines were adapted to the same units. Because every result lives on LFDocQA, it is hard to separate the contribution of the layout fusion from possible benchmark artifacts that reward block-aware methods. The abstract does not address this, so the central claim stays unverified for now. This is aimed at groups already working on document RAG or multimodal retrieval who need finer control over context. A reader in that area would find the framework description useful as a starting point even before the numbers are stress-tested. It is worth sending to peer review so the experimental section can be checked for fair baseline handling and external validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes LFRAG, a multimodal RAG framework that replaces page-level retrieval with block-level retrieval obtained via layout segmentation. It introduces a semantic-layout fusion encoder using cross-attention and block-level late interaction retrieval, constructs the LFDocQA benchmark with block-level annotations across document types, and reports SOTA retrieval performance, a 7.20% gain in answer accuracy over the best baseline, and a 73.07% reduction in token consumption for generation on LFDocQA.

Significance. If the reported gains prove robust under fair baseline adaptation, the shift to layout-aware block retrieval could improve precision and efficiency in multimodal RAG for visually rich documents; the new LFDocQA benchmark would also supply a useful testbed for fine-grained evaluation.

major comments (2)

[§4, §4.1] §4 (Experiments) and §4.1 (Benchmark Construction): the headline deltas (+7.20% answer accuracy, -73.07% tokens) are measured exclusively on the self-constructed LFDocQA; the manuscript must specify the annotation protocol, inter-annotator agreement statistics, and whether page-level baselines were given equivalent block units or remained at page granularity, as this directly determines whether the gains demonstrate superiority of the layout-oriented approach rather than benchmark artifact.
[§4.2] §4.2 (Baseline Comparison): without explicit confirmation that all baselines received the same block-level segmentation and retrieval units as LFRAG, the cross-method comparison on LFDocQA cannot isolate the contribution of the semantic-layout fusion encoder and late-interaction retrieval from the granularity change itself.

minor comments (2)

The abstract states that code and datasets will be released soon; the final version should include a concrete release URL or repository link.
[§3.2] Notation for the cross-attention module in the fusion encoder should be defined explicitly (e.g., query/key/value projections) rather than left to the figure caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [§4, §4.1] §4 (Experiments) and §4.1 (Benchmark Construction): the headline deltas (+7.20% answer accuracy, -73.07% tokens) are measured exclusively on the self-constructed LFDocQA; the manuscript must specify the annotation protocol, inter-annotator agreement statistics, and whether page-level baselines were given equivalent block units or remained at page granularity, as this directly determines whether the gains demonstrate superiority of the layout-oriented approach rather than benchmark artifact.

Authors: We agree that additional details on benchmark construction are required for transparency. In the revised manuscript we will expand §4.1 with a full description of the annotation protocol (including segmentation guidelines, annotation criteria, and quality-control steps) and report inter-annotator agreement statistics. We will also explicitly confirm that page-level baselines were adapted to the identical block-level units produced by the layout segmentation, thereby isolating the contribution of LFRAG’s components from the granularity change. revision: yes
Referee: [§4.2] §4.2 (Baseline Comparison): without explicit confirmation that all baselines received the same block-level segmentation and retrieval units as LFRAG, the cross-method comparison on LFDocQA cannot isolate the contribution of the semantic-layout fusion encoder and late-interaction retrieval from the granularity change itself.

Authors: We will revise §4.2 to state explicitly that every baseline was supplied with the same block-level segmentation and retrieval units used by LFRAG. This clarification will demonstrate that observed gains arise from the semantic-layout fusion encoder and block-level late-interaction retrieval rather than from differences in retrieval granularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of new framework and benchmark

full rationale

The provided abstract and description introduce LFRAG as a new framework using layout segmentation and a semantic-layout fusion encoder, plus the LFDocQA benchmark with block-level annotations. Performance claims (SOTA retrieval, +7.20% accuracy, -73.07% tokens) are presented as results of experiments on this benchmark. No equations, parameter fitting presented as prediction, self-citations as load-bearing premises, or derivations that reduce to inputs by construction appear. The chain is empirical comparison rather than tautological. This matches the default expectation for non-circular papers; the skeptic concern addresses benchmark fairness but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; limited visibility into parameters or assumptions. The core premise that layout segmentation yields coherent semantic units is treated as a domain assumption.

axioms (1)

domain assumption Layout segmentation produces semantically coherent fine-grained retrieval units
Invoked as the foundation for moving from page-level to block-level retrieval.

pith-pipeline@v0.9.0 · 5797 in / 1091 out tokens · 20784 ms · 2026-05-25T00:26:58.500932+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.ArXiv, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bench- marking large language models in retrieval-augmented gen- eration

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Bench- marking large language models in retrieval-augmented gen- eration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024

work page 2024
[4]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Document haystacks: Vision-language reasoning over piles of 1000+ documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, and Mo- hamed Elhoseiny. Document haystacks: Vision-language reasoning over piles of 1000+ documents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24817–24826, 2025

work page 2025
[6]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropoli- tansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused sum- marization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large lan- guage models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Retrieval augmented language model pre- training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020
[10]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

work page 2022
[11]

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Yizheng Huang and Jimmy Huang. A survey on retrieval- augmented text generation for large language models.arXiv preprint arXiv:2404.10981, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Uda: A bench- mark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Pro- cessing Systems, 37:67200–67217, 2024

Yulong Hui, Yao Lu, and Huanchen Zhang. Uda: A bench- mark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Pro- cessing Systems, 37:67200–67217, 2024

work page 2024
[13]

Colbert: Efficient and effective passage search via contextualized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SI- GIR conference on research and development in Information Retrieval, pages 39–48, 2020

work page 2020
[14]

Building and better understanding vision- language models: insights and future directions., 2024

Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: insights and future directions., 2024

work page 2024
[15]

Nv- embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models. InThe Thirteenth International Confer- ence on Learning Representations, 2024

work page 2024
[16]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[17]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024

work page 2024
[18]

Regionrag: Region-level retrieval- augmented generation for visual document understanding

Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, and Hongtao Xie. Regionrag: Region-level retrieval- augmented generation for visual document understanding. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 40, pages 6662–6670, 2026

work page 2026
[19]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[20]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language mod- els.ACM Transactions on Information Systems, 43(2):1–32, 2025

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and En- hong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language mod- els.ACM Transactions on Information Systems, 43(2):1–32, 2025

work page 2025
[21]

Unifying multimodal retrieval via document screenshot embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024

work page 2024
[22]

Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

Quentin Mac ´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

work page arXiv 2025
[23]

Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

work page 2022
[24]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022

work page 2022
[25]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[26]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009
[27]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024
[28]

An overview of the tesseract ocr engine

Ray Smith. An overview of the tesseract ocr engine. InNinth international conference on document analysis and recogni- tion (ICDAR 2007), volume 2, pages 629–633. IEEE, 2007

work page 2007
[29]

Garage: A benchmark with grounding annotations for rag evaluation

Ionut Teodor Sorodoc, Leonardo FR Ribeiro, Rexhina Blloshmi, Christopher Davis, and Adri`a de Gispert. Garage: A benchmark with grounding annotations for rag evaluation. InFindings of the Association for Computational Linguis- tics: ACL 2025, pages 17030–17049, 2025

work page 2025
[30]

Visdom: Multi-document qa with visually rich elements using multi- modal retrieval-augmented generation

Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A Rossi, and Dinesh Manocha. Visdom: Multi-document qa with visually rich elements using multi- modal retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies...

work page 2025
[31]

Vdocrag: Retrieval- augmented generation over visually-rich documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. Vdocrag: Retrieval- augmented generation over visually-rich documents. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 24827–24837, 2025

work page 2025
[32]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023

work page 2023
[33]

Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation

Nobuhiro Ueda, Yuyang Dong, Kriszti ´an Boros, Daiki Ito, Takuya Sera, and Masafumi Oyamada. Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1618–1637, 2026

work page 2026
[34]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents

Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shi- hang Wang, Pengjun Xie, and Feng Zhao. Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9124–9145, 2025

work page 2025
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Qwen2.5 technical report, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2025

work page 2025
[38]

Crag-comprehensive rag bench- mark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag bench- mark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

work page 2024
[39]

Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents. InThe Thirteenth Inter- national Conference on Learning Representations, 2024

work page 2024
[40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[41]

relevant_bbox_ids

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024. A. More Details on Datasets This section provides additional details on the construc- tion of the LFDocQA benchmark used in our experiments, i...

work page arXiv 2024
[42]

These are high-value for RAG training

**Prioritize Relational Questions**: Strive to create questions that require synthesizing information from **multiple (two or three or four) bounding boxes** to answer. These are high-value for RAG training

work page
[43]

**Allow High-Quality Single-Box Questions**: You may also include questions that can be answered from a single box, but only if the question requires summarization, inference, or understanding the text's purpose (not just simple text extraction)

work page
[44]

Introduction

**Leverage Layout and Semantic Relationships**: Generate questions that require reasoning about the relationship between different content types based on their visual layout and semantic meaning. For example: * Ask how a figure is explained by its nearby caption. * Ask what the text under a specific title (e.g., "Introduction") describes. * Ask to summari...

work page
[45]

Your output JSON must identify the relevant ID numbers (the integers) required to answer the question in the`relevant_bbox_ids`field

**Identify Evidence IDs**: Each box is labeled in the format`classname_ID`. Your output JSON must identify the relevant ID numbers (the integers) required to answer the question in the`relevant_bbox_ids`field

work page
[46]

They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1')

**Do Not Expose Annotation Labels**: The`question`and`answer`strings you generate must be natural and sound as if a human is asking about the document. They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1'). These labels are only for your internal use to populate the`relevant_bbox_ids`list

work page
[47]

How many tables are in this paper?

**Avoid General/Retrieval-Unfriendly Questions**: Do not ask overly broad questions about the document's structure or simple facts that are not suitable for a retrieval task. For example, avoid questions like "How many tables are in this paper?" or "What is the title of section 2?". Focus on the specific content and its meaning

work page
[48]

Do not include any explanatory text before or after the JSON

**JSON Output Format**: Your response must be a strictly formatted JSON object. Do not include any explanatory text before or after the JSON

work page
[49]

QAs": [ {

**JSON Structure**: The JSON object must contain a single key "QAs", which holds a list of objects. Each object in the list must contain three keys:`question`(string),`answer`(string), and`relevant_bbox_ids`(a list of integers). Example of the required output format: ```json { "QAs": [ { "question": "What are the affiliation and email address of the autho...

work page
[50]

If the question is already self-contained and document-agnostic, keep it unchanged

work page
[51]

the document

If the question is ambiguous or context-dependent (e.g., uses "the document", "the product"), use the relevant bboxes to find specific information (e.g., unique names, identifiers) and rewrite it into a standalone, document-independent question. *When rewriting, keep the new question as concise as possible while preserving the original meaning.*

work page
[52]

question

If it cannot be rewritten into a clear standalone form, discard it (set question to null). - **Step 2: Output your final judgment in pure JSON.** ### Output Format: You must respond in pure JSON, with the following fields: { "question": "rewritten or original question text, or null if discarded", "reason_for_question": "Explain briefly why you kept, rewro...

work page
[53]

For numbers, dates, and proper nouns: if the main conclusion is correct but formatting differs slightly (e.g., 09/01 vs September 1), be lenient

work page
[54]

For long answers: if the core conclusion is correct but contains a small amount of noise, do not over-penalize

work page
[55]

not mentioned/cannot be determined

If the reference answer indicates "not mentioned/cannot be determined", then fabricated specific facts should receive a low score (0-2). Return strict JSON only. Do not output anything else: {{"score": <integer 0-5>, "reason": "one short English reason"}} [Reference Answer] {ground_truth} [Model Answer] {prediction} Figure 10: Prompt for LLM-based QA Answ...

work page

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.ArXiv, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Bench- marking large language models in retrieval-augmented gen- eration

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Bench- marking large language models in retrieval-augmented gen- eration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024

work page 2024

[4] [4]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Document haystacks: Vision-language reasoning over piles of 1000+ documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, and Mo- hamed Elhoseiny. Document haystacks: Vision-language reasoning over piles of 1000+ documents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24817–24826, 2025

work page 2025

[6] [6]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropoli- tansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused sum- marization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large lan- guage models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Retrieval augmented language model pre- training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020

[10] [10]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

work page 2022

[11] [11]

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Yizheng Huang and Jimmy Huang. A survey on retrieval- augmented text generation for large language models.arXiv preprint arXiv:2404.10981, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Uda: A bench- mark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Pro- cessing Systems, 37:67200–67217, 2024

Yulong Hui, Yao Lu, and Huanchen Zhang. Uda: A bench- mark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Pro- cessing Systems, 37:67200–67217, 2024

work page 2024

[13] [13]

Colbert: Efficient and effective passage search via contextualized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SI- GIR conference on research and development in Information Retrieval, pages 39–48, 2020

work page 2020

[14] [14]

Building and better understanding vision- language models: insights and future directions., 2024

Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: insights and future directions., 2024

work page 2024

[15] [15]

Nv- embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models. InThe Thirteenth International Confer- ence on Learning Representations, 2024

work page 2024

[16] [16]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[17] [17]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024

work page 2024

[18] [18]

Regionrag: Region-level retrieval- augmented generation for visual document understanding

Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, and Hongtao Xie. Regionrag: Region-level retrieval- augmented generation for visual document understanding. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 40, pages 6662–6670, 2026

work page 2026

[19] [19]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004

[20] [20]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language mod- els.ACM Transactions on Information Systems, 43(2):1–32, 2025

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and En- hong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language mod- els.ACM Transactions on Information Systems, 43(2):1–32, 2025

work page 2025

[21] [21]

Unifying multimodal retrieval via document screenshot embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024

work page 2024

[22] [22]

Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

Quentin Mac ´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

work page arXiv 2025

[23] [23]

Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

work page 2022

[24] [24]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022

work page 2022

[25] [25]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021

[26] [26]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009

[27] [27]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024

[28] [28]

An overview of the tesseract ocr engine

Ray Smith. An overview of the tesseract ocr engine. InNinth international conference on document analysis and recogni- tion (ICDAR 2007), volume 2, pages 629–633. IEEE, 2007

work page 2007

[29] [29]

Garage: A benchmark with grounding annotations for rag evaluation

Ionut Teodor Sorodoc, Leonardo FR Ribeiro, Rexhina Blloshmi, Christopher Davis, and Adri`a de Gispert. Garage: A benchmark with grounding annotations for rag evaluation. InFindings of the Association for Computational Linguis- tics: ACL 2025, pages 17030–17049, 2025

work page 2025

[30] [30]

Visdom: Multi-document qa with visually rich elements using multi- modal retrieval-augmented generation

Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A Rossi, and Dinesh Manocha. Visdom: Multi-document qa with visually rich elements using multi- modal retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies...

work page 2025

[31] [31]

Vdocrag: Retrieval- augmented generation over visually-rich documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. Vdocrag: Retrieval- augmented generation over visually-rich documents. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 24827–24837, 2025

work page 2025

[32] [32]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023

work page 2023

[33] [33]

Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation

Nobuhiro Ueda, Yuyang Dong, Kriszti ´an Boros, Daiki Ito, Takuya Sera, and Masafumi Oyamada. Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1618–1637, 2026

work page 2026

[34] [34]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents

Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shi- hang Wang, Pengjun Xie, and Feng Zhao. Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9124–9145, 2025

work page 2025

[36] [36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Qwen2.5 technical report, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2025

work page 2025

[38] [38]

Crag-comprehensive rag bench- mark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag bench- mark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

work page 2024

[39] [39]

Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents. InThe Thirteenth Inter- national Conference on Learning Representations, 2024

work page 2024

[40] [40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023

[41] [41]

relevant_bbox_ids

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024. A. More Details on Datasets This section provides additional details on the construc- tion of the LFDocQA benchmark used in our experiments, i...

work page arXiv 2024

[42] [42]

These are high-value for RAG training

**Prioritize Relational Questions**: Strive to create questions that require synthesizing information from **multiple (two or three or four) bounding boxes** to answer. These are high-value for RAG training

work page

[43] [43]

**Allow High-Quality Single-Box Questions**: You may also include questions that can be answered from a single box, but only if the question requires summarization, inference, or understanding the text's purpose (not just simple text extraction)

work page

[44] [44]

Introduction

**Leverage Layout and Semantic Relationships**: Generate questions that require reasoning about the relationship between different content types based on their visual layout and semantic meaning. For example: * Ask how a figure is explained by its nearby caption. * Ask what the text under a specific title (e.g., "Introduction") describes. * Ask to summari...

work page

[45] [45]

Your output JSON must identify the relevant ID numbers (the integers) required to answer the question in the`relevant_bbox_ids`field

**Identify Evidence IDs**: Each box is labeled in the format`classname_ID`. Your output JSON must identify the relevant ID numbers (the integers) required to answer the question in the`relevant_bbox_ids`field

work page

[46] [46]

They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1')

**Do Not Expose Annotation Labels**: The`question`and`answer`strings you generate must be natural and sound as if a human is asking about the document. They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1'). These labels are only for your internal use to populate the`relevant_bbox_ids`list

work page

[47] [47]

How many tables are in this paper?

**Avoid General/Retrieval-Unfriendly Questions**: Do not ask overly broad questions about the document's structure or simple facts that are not suitable for a retrieval task. For example, avoid questions like "How many tables are in this paper?" or "What is the title of section 2?". Focus on the specific content and its meaning

work page

[48] [48]

Do not include any explanatory text before or after the JSON

**JSON Output Format**: Your response must be a strictly formatted JSON object. Do not include any explanatory text before or after the JSON

work page

[49] [49]

QAs": [ {

**JSON Structure**: The JSON object must contain a single key "QAs", which holds a list of objects. Each object in the list must contain three keys:`question`(string),`answer`(string), and`relevant_bbox_ids`(a list of integers). Example of the required output format: ```json { "QAs": [ { "question": "What are the affiliation and email address of the autho...

work page

[50] [50]

If the question is already self-contained and document-agnostic, keep it unchanged

work page

[51] [51]

the document

If the question is ambiguous or context-dependent (e.g., uses "the document", "the product"), use the relevant bboxes to find specific information (e.g., unique names, identifiers) and rewrite it into a standalone, document-independent question. *When rewriting, keep the new question as concise as possible while preserving the original meaning.*

work page

[52] [52]

question

If it cannot be rewritten into a clear standalone form, discard it (set question to null). - **Step 2: Output your final judgment in pure JSON.** ### Output Format: You must respond in pure JSON, with the following fields: { "question": "rewritten or original question text, or null if discarded", "reason_for_question": "Explain briefly why you kept, rewro...

work page

[53] [53]

For numbers, dates, and proper nouns: if the main conclusion is correct but formatting differs slightly (e.g., 09/01 vs September 1), be lenient

work page

[54] [54]

For long answers: if the core conclusion is correct but contains a small amount of noise, do not over-penalize

work page

[55] [55]

not mentioned/cannot be determined

If the reference answer indicates "not mentioned/cannot be determined", then fabricated specific facts should receive a low score (0-2). Return strict JSON only. Do not output anything else: {{"score": <integer 0-5>, "reason": "one short English reason"}} [Reference Answer] {ground_truth} [Model Answer] {prediction} Figure 10: Prompt for LLM-based QA Answ...

work page