LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
Pith reviewed 2026-05-25 00:26 UTC · model grok-4.3
The pith
LFRAG replaces page-level retrieval in multimodal RAG with block-level units created by layout segmentation and fused via cross-attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LFRAG advances multimodal RAG from page-level to block-level retrieval by first applying layout segmentation to form semantically coherent fine-grained units, then feeding those units to a semantic-layout fusion encoder that merges local semantics with global context through cross-attention, and finally performing block-level late interaction retrieval to achieve precise query-content alignment while discarding irrelevant material for generation.
What carries the argument
layout segmentation into semantically coherent blocks combined with a semantic-layout fusion encoder and block-level late interaction retrieval
If this is right
- LFRAG reaches state-of-the-art retrieval performance on LFDocQA.
- Answer accuracy rises 7.20 percent over the strongest baseline.
- Token consumption in generation drops 73.07 percent.
- Query-content alignment becomes more precise by excluding irrelevant blocks.
Where Pith is reading between the lines
- The same segmentation-plus-fusion pattern could be tested on non-question-answering tasks such as document summarization or classification.
- Dynamic adjustment of block boundaries according to query type might further reduce context size.
- LFDocQA supplies a reusable testbed for any future fine-grained multimodal retrieval method.
Load-bearing premise
Layout segmentation reliably yields semantically coherent blocks that improve alignment without losing critical context.
What would settle it
If page-level retrieval matches or exceeds LFRAG accuracy and token efficiency on LFDocQA or an equivalent block-annotated benchmark, the claimed advantage of the block-level approach would be refuted.
Figures
read the original abstract
Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LFRAG, a multimodal RAG framework that replaces page-level retrieval with block-level retrieval obtained via layout segmentation. It introduces a semantic-layout fusion encoder using cross-attention and block-level late interaction retrieval, constructs the LFDocQA benchmark with block-level annotations across document types, and reports SOTA retrieval performance, a 7.20% gain in answer accuracy over the best baseline, and a 73.07% reduction in token consumption for generation on LFDocQA.
Significance. If the reported gains prove robust under fair baseline adaptation, the shift to layout-aware block retrieval could improve precision and efficiency in multimodal RAG for visually rich documents; the new LFDocQA benchmark would also supply a useful testbed for fine-grained evaluation.
major comments (2)
- [§4, §4.1] §4 (Experiments) and §4.1 (Benchmark Construction): the headline deltas (+7.20% answer accuracy, -73.07% tokens) are measured exclusively on the self-constructed LFDocQA; the manuscript must specify the annotation protocol, inter-annotator agreement statistics, and whether page-level baselines were given equivalent block units or remained at page granularity, as this directly determines whether the gains demonstrate superiority of the layout-oriented approach rather than benchmark artifact.
- [§4.2] §4.2 (Baseline Comparison): without explicit confirmation that all baselines received the same block-level segmentation and retrieval units as LFRAG, the cross-method comparison on LFDocQA cannot isolate the contribution of the semantic-layout fusion encoder and late-interaction retrieval from the granularity change itself.
minor comments (2)
- The abstract states that code and datasets will be released soon; the final version should include a concrete release URL or repository link.
- [§3.2] Notation for the cross-attention module in the fusion encoder should be defined explicitly (e.g., query/key/value projections) rather than left to the figure caption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [§4, §4.1] §4 (Experiments) and §4.1 (Benchmark Construction): the headline deltas (+7.20% answer accuracy, -73.07% tokens) are measured exclusively on the self-constructed LFDocQA; the manuscript must specify the annotation protocol, inter-annotator agreement statistics, and whether page-level baselines were given equivalent block units or remained at page granularity, as this directly determines whether the gains demonstrate superiority of the layout-oriented approach rather than benchmark artifact.
Authors: We agree that additional details on benchmark construction are required for transparency. In the revised manuscript we will expand §4.1 with a full description of the annotation protocol (including segmentation guidelines, annotation criteria, and quality-control steps) and report inter-annotator agreement statistics. We will also explicitly confirm that page-level baselines were adapted to the identical block-level units produced by the layout segmentation, thereby isolating the contribution of LFRAG’s components from the granularity change. revision: yes
-
Referee: [§4.2] §4.2 (Baseline Comparison): without explicit confirmation that all baselines received the same block-level segmentation and retrieval units as LFRAG, the cross-method comparison on LFDocQA cannot isolate the contribution of the semantic-layout fusion encoder and late-interaction retrieval from the granularity change itself.
Authors: We will revise §4.2 to state explicitly that every baseline was supplied with the same block-level segmentation and retrieval units used by LFRAG. This clarification will demonstrate that observed gains arise from the semantic-layout fusion encoder and block-level late-interaction retrieval rather than from differences in retrieval granularity. revision: yes
Circularity Check
No significant circularity; claims rest on empirical evaluation of new framework and benchmark
full rationale
The provided abstract and description introduce LFRAG as a new framework using layout segmentation and a semantic-layout fusion encoder, plus the LFDocQA benchmark with block-level annotations. Performance claims (SOTA retrieval, +7.20% accuracy, -73.07% tokens) are presented as results of experiments on this benchmark. No equations, parameter fitting presented as prediction, self-citations as load-bearing premises, or derivations that reduce to inputs by construction appear. The chain is empirical comparison rather than tautological. This matches the default expectation for non-circular papers; the skeptic concern addresses benchmark fairness but does not constitute circularity under the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Layout segmentation produces semantically coherent fine-grained retrieval units
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.ArXiv, abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Bench- marking large language models in retrieval-augmented gen- eration
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Bench- marking large language models in retrieval-augmented gen- eration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024
work page 2024
-
[4]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Document haystacks: Vision-language reasoning over piles of 1000+ documents
Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, and Mo- hamed Elhoseiny. Document haystacks: Vision-language reasoning over piles of 1000+ documents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24817–24826, 2025
work page 2025
-
[6]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropoli- tansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused sum- marization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large lan- guage models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Retrieval augmented language model pre- training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre- training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020
work page 2020
-
[10]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022
work page 2022
-
[11]
A Survey on Retrieval-Augmented Text Generation for Large Language Models
Yizheng Huang and Jimmy Huang. A survey on retrieval- augmented text generation for large language models.arXiv preprint arXiv:2404.10981, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Yulong Hui, Yao Lu, and Huanchen Zhang. Uda: A bench- mark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Pro- cessing Systems, 37:67200–67217, 2024
work page 2024
-
[13]
Colbert: Efficient and effective passage search via contextualized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SI- GIR conference on research and development in Information Retrieval, pages 39–48, 2020
work page 2020
-
[14]
Building and better understanding vision- language models: insights and future directions., 2024
Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: insights and future directions., 2024
work page 2024
-
[15]
Nv- embed: Improved techniques for training llms as generalist embedding models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models. InThe Thirteenth International Confer- ence on Learning Representations, 2024
work page 2024
-
[16]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[17]
Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024
work page 2024
-
[18]
Regionrag: Region-level retrieval- augmented generation for visual document understanding
Yinglu Li, Zhiying Lu, Zhihang Liu, Yiwei Sun, Chuanbin Liu, and Hongtao Xie. Regionrag: Region-level retrieval- augmented generation for visual document understanding. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 40, pages 6662–6670, 2026
work page 2026
-
[19]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[20]
Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and En- hong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language mod- els.ACM Transactions on Information Systems, 43(2):1–32, 2025
work page 2025
-
[21]
Unifying multimodal retrieval via document screenshot embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024
work page 2024
-
[22]
Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025
Quentin Mac ´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025
-
[23]
Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022
work page 2022
-
[24]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022
work page 2022
-
[25]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021
work page 2021
-
[26]
Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009
work page 2009
-
[27]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
work page 2024
-
[28]
An overview of the tesseract ocr engine
Ray Smith. An overview of the tesseract ocr engine. InNinth international conference on document analysis and recogni- tion (ICDAR 2007), volume 2, pages 629–633. IEEE, 2007
work page 2007
-
[29]
Garage: A benchmark with grounding annotations for rag evaluation
Ionut Teodor Sorodoc, Leonardo FR Ribeiro, Rexhina Blloshmi, Christopher Davis, and Adri`a de Gispert. Garage: A benchmark with grounding annotations for rag evaluation. InFindings of the Association for Computational Linguis- tics: ACL 2025, pages 17030–17049, 2025
work page 2025
-
[30]
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A Rossi, and Dinesh Manocha. Visdom: Multi-document qa with visually rich elements using multi- modal retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies...
work page 2025
-
[31]
Vdocrag: Retrieval- augmented generation over visually-rich documents
Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. Vdocrag: Retrieval- augmented generation over visually-rich documents. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 24827–24837, 2025
work page 2025
-
[32]
Slidevqa: A dataset for document visual question answering on multiple images
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023
work page 2023
-
[33]
Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation
Nobuhiro Ueda, Yuyang Dong, Kriszti ´an Boros, Daiki Ito, Takuya Sera, and Masafumi Oyamada. Scan: Semantic document layout analysis for textual and visual retrieval- augmented generation. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1618–1637, 2026
work page 2026
-
[34]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents
Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shi- hang Wang, Pengjun Xie, and Feng Zhao. Vidorag: Visual document retrieval-augmented generation via dynamic iter- ative reasoning agents. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9124–9145, 2025
work page 2025
-
[36]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Qwen2.5 technical report, 2025
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2025
work page 2025
-
[38]
Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag bench- mark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024
work page 2024
-
[39]
Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented genera- tion on multi-modality documents. InThe Thirteenth Inter- national Conference on Learning Representations, 2024
work page 2024
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[41]
Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024. A. More Details on Datasets This section provides additional details on the construc- tion of the LFDocQA benchmark used in our experiments, i...
-
[42]
These are high-value for RAG training
**Prioritize Relational Questions**: Strive to create questions that require synthesizing information from **multiple (two or three or four) bounding boxes** to answer. These are high-value for RAG training
-
[43]
**Allow High-Quality Single-Box Questions**: You may also include questions that can be answered from a single box, but only if the question requires summarization, inference, or understanding the text's purpose (not just simple text extraction)
-
[44]
**Leverage Layout and Semantic Relationships**: Generate questions that require reasoning about the relationship between different content types based on their visual layout and semantic meaning. For example: * Ask how a figure is explained by its nearby caption. * Ask what the text under a specific title (e.g., "Introduction") describes. * Ask to summari...
-
[45]
**Identify Evidence IDs**: Each box is labeled in the format`classname_ID`. Your output JSON must identify the relevant ID numbers (the integers) required to answer the question in the`relevant_bbox_ids`field
-
[46]
They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1')
**Do Not Expose Annotation Labels**: The`question`and`answer`strings you generate must be natural and sound as if a human is asking about the document. They must not contain the`classname_ID`labels (e.g., 'title_4', 'plain text_1'). These labels are only for your internal use to populate the`relevant_bbox_ids`list
-
[47]
How many tables are in this paper?
**Avoid General/Retrieval-Unfriendly Questions**: Do not ask overly broad questions about the document's structure or simple facts that are not suitable for a retrieval task. For example, avoid questions like "How many tables are in this paper?" or "What is the title of section 2?". Focus on the specific content and its meaning
-
[48]
Do not include any explanatory text before or after the JSON
**JSON Output Format**: Your response must be a strictly formatted JSON object. Do not include any explanatory text before or after the JSON
-
[49]
**JSON Structure**: The JSON object must contain a single key "QAs", which holds a list of objects. Each object in the list must contain three keys:`question`(string),`answer`(string), and`relevant_bbox_ids`(a list of integers). Example of the required output format: ```json { "QAs": [ { "question": "What are the affiliation and email address of the autho...
-
[50]
If the question is already self-contained and document-agnostic, keep it unchanged
-
[51]
If the question is ambiguous or context-dependent (e.g., uses "the document", "the product"), use the relevant bboxes to find specific information (e.g., unique names, identifiers) and rewrite it into a standalone, document-independent question. *When rewriting, keep the new question as concise as possible while preserving the original meaning.*
-
[52]
If it cannot be rewritten into a clear standalone form, discard it (set question to null). - **Step 2: Output your final judgment in pure JSON.** ### Output Format: You must respond in pure JSON, with the following fields: { "question": "rewritten or original question text, or null if discarded", "reason_for_question": "Explain briefly why you kept, rewro...
-
[53]
For numbers, dates, and proper nouns: if the main conclusion is correct but formatting differs slightly (e.g., 09/01 vs September 1), be lenient
-
[54]
For long answers: if the core conclusion is correct but contains a small amount of noise, do not over-penalize
-
[55]
not mentioned/cannot be determined
If the reference answer indicates "not mentioned/cannot be determined", then fabricated specific facts should receive a low score (0-2). Return strict JSON only. Do not output anything else: {{"score": <integer 0-5>, "reason": "one short English reason"}} [Reference Answer] {ground_truth} [Model Answer] {prediction} Figure 10: Prompt for LLM-based QA Answ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.