arxiv: 2604.12047 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.IR

Recognition: unknown

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Omar El Bachyr , Yewei Song , Saad Ezzini , Jacques Klein , Tegawend\'e F. Bissyand\'e , Anas Zilali , Ulrick Ble , Anne Goujon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords PDF parsingchunking strategiesRAGfinancial question answeringdocument structureTableQuestempirical studyinformation extraction

0 comments

The pith

PDF parsers and chunking strategies significantly affect the performance of RAG systems for financial question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper empirically evaluates how different PDF parsers and chunking strategies influence retrieval-augmented generation for answering questions drawn from financial documents. It systematically tests multiple parsers alongside chunking methods that vary in overlap, using two benchmarks from the financial domain and introducing a new table-focused benchmark called TableQuest. The study measures how well these choices preserve document structure and lead to correct answers. Readers might care because PDFs with tables and heterogeneous content are common in finance, yet difficult for automated systems to process accurately. The results are intended to supply concrete guidelines for designing effective RAG pipelines.

Core claim

The central discovery is that different combinations of PDF parsers and chunking strategies, including those with varied overlap, produce measurable differences in preserving the structure of financial documents and in the correctness of answers generated by RAG systems, with the evaluation across benchmarks yielding practical guidelines for building robust PDF-understanding pipelines.

What carries the argument

The key machinery is the systematic examination of PDF parsers and chunking strategies with different overlaps, assessed by their synergy in maintaining document structure and answer accuracy within RAG for question answering on financial benchmarks.

If this is right

Optimal parser-chunker pairs improve handling of tables and mixed content in PDFs for better RAG results.
Chunk overlap levels interact with parser choice to affect performance.
The new TableQuest benchmark provides a targeted way to evaluate table understanding in financial QA.
These empirical findings directly inform the selection of components when constructing RAG systems for PDF documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evaluation methods could be applied to PDFs in legal or scientific domains to derive domain-specific guidelines.
Future experiments might combine these parsing and chunking approaches with advanced layout analysis techniques for further gains.
Results may vary with different base language models or RAG architectures, suggesting the need for model-specific tuning.

Load-bearing premise

The assumption that the selected financial benchmarks and metrics capture the essential real-world challenges of PDF understanding and that findings generalize beyond the tested configurations.

What would settle it

A demonstration that an untested PDF parser or chunking method achieves higher accuracy on the same benchmarks, or that the recommended strategies underperform on a new financial dataset with different table structures, would challenge the guidelines.

Figures

Figures reproduced from arXiv: 2604.12047 by Anas Zilali, Anne Goujon, Jacques Klein, Omar El Bachyr, Saad Ezzini, Tegawend\'e F. Bissyand\'e, Ulrick Ble, Yewei Song.

**Figure 2.** Figure 2: TableQuest Dataset Construction Process. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end PDF-QA pipeline. Tablequest) is encoded and queried against the selected index to retrieve the top-K relevant pages. Those pages, together with the original question, form the prompt for one of several tested LLMs (see Table12), which generates an answer. In the evaluation stage, we (a) assess retrieval quality by measuring the pipeline’s ability to return the correct pages for each query, and (… view at source ↗

**Figure 4.** Figure 4: Parser stability in page-level MRR (Mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmaps of page-level retrieval performance [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a clean empirical comparison of PDF parsers and chunking for financial RAG and releases TableQuest, but the resulting guidelines are limited by the financial-only data.

read the letter

This paper runs a clean empirical comparison of PDF parsers and chunking for financial RAG and releases TableQuest, but the resulting guidelines are limited by the financial-only data. They test multiple parsers on how they handle mixed content in PDFs, then vary chunk sizes and overlaps to measure effects on retrieval and answer accuracy across two financial QA sets, one of them new. The side-by-side results show which combinations keep table structure intact better and reduce errors in the final answers. That kind of direct, measurable evaluation is the paper's strength. It gives practitioners concrete numbers instead of leaving them to guess at component choices. The benchmark release also lets others rerun or extend the tests. The main limitation is the domain restriction. All benchmarks come from financial reports, so the practical guidelines for robust RAG pipelines rest on document formats typical of that field. The stress-test point stands: without a cross-domain check on other PDF types, such as those with equations or irregular layouts, it is unclear whether the same parser and chunking rankings would hold up elsewhere. The abstract's claim about PDF understanding in general therefore outruns the evidence shown. The experiments themselves look free of circularity or obvious post-selection issues. This work is for engineers and researchers tuning RAG systems on financial or similar structured documents. Readers in that niche get usable comparisons and a new public resource. It deserves a serious referee because the design is systematic and the contribution is reproducible, though reviewers will need to press the authors on scope and transfer.

Referee Report

1 major / 1 minor

Summary. The paper performs an empirical study on the impact of PDF parsers and chunking strategies (including different overlap settings) on the performance of RAG-based question answering systems. It uses two financial domain benchmarks, including the newly proposed TableQuest dataset, to evaluate how these choices affect the preservation of document structure and the correctness of answers. The authors conclude that their results provide practical guidelines for developing robust RAG pipelines for PDF understanding.

Significance. If the experimental results are reliable, this study contributes empirical evidence on component interactions in RAG systems for handling complex PDF content like tables and text. The public release of TableQuest is a notable strength, as it provides a new resource for financial QA research. The work could inform practitioners in the financial sector on optimizing their PDF processing pipelines. However, the domain-specific nature of the benchmarks may constrain the generalizability of the proposed guidelines to broader PDF understanding tasks.

major comments (1)

[Abstract] The central claim that the study offers 'practical guidelines for building robust RAG pipelines for PDF understanding' is not adequately supported by the experimental design. The evaluation is confined to financial-domain QA benchmarks, without testing on other document types (e.g., scientific papers with equations or multi-column layouts). This raises the risk that the identified synergies between parsers and chunking methods are artifacts of financial report formatting rather than general PDF properties, undermining the broader applicability asserted in the abstract and conclusions.

minor comments (1)

Consider adding a summary table of the key parser-chunking combinations and their performance metrics across the benchmarks to improve the clarity of the results presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract] The central claim that the study offers 'practical guidelines for building robust RAG pipelines for PDF understanding' is not adequately supported by the experimental design. The evaluation is confined to financial-domain QA benchmarks, without testing on other document types (e.g., scientific papers with equations or multi-column layouts). This raises the risk that the identified synergies between parsers and chunking methods are artifacts of financial report formatting rather than general PDF properties, undermining the broader applicability asserted in the abstract and conclusions.

Authors: We agree that the experimental scope is limited to financial-domain benchmarks and that this constrains claims of broad applicability across all PDF types. Financial reports feature dense tabular structures and specific layouts that may not generalize to scientific papers with equations or other multi-column formats, so the observed parser-chunking synergies could partly reflect domain-specific formatting. To correct this, we will revise the abstract to state that the guidelines apply to financial PDF understanding, update the conclusions accordingly, and add an explicit limitations paragraph discussing domain specificity and the need for future cross-domain validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is a purely empirical evaluation of PDF parsers, chunking strategies, and their synergies on two financial-domain QA benchmarks (including a new TableQuest dataset). It reports experimental results to derive practical guidelines for RAG pipelines. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methodology. The guidelines are presented as direct outcomes of the systematic comparisons performed, with no reduction of claims to their own inputs by construction. This is a standard empirical study whose central claims rest on observed performance metrics rather than any internal circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical evaluation paper, the central claim rests on assumptions about benchmark representativeness and metric validity rather than mathematical derivations or new entities.

axioms (1)

domain assumption The selected financial benchmarks and metrics are representative proxies for PDF understanding performance in QA tasks.
The study relies on TableQuest and the second benchmark being suitable for drawing general guidelines.

pith-pipeline@v0.9.0 · 5501 in / 1107 out tokens · 80708 ms · 2026-05-10T15:13:10.243774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Narayan S Adhikari and Shradha Agarwal. 2024. A Comparative Study of PDF Parsing Tools Across Diverse Document Categories. arXiv preprint arXiv:2410.09871 (2024)

work page arXiv 2024
[2]

Adobe. 2025. What Is a PDF? Portable Document Format. https://www.adobe. com/acrobat/about-adobe-pdf.html. https://www.adobe.com/acrobat/about- adobe-pdf.html Accessed: 14 May 2025

2025
[3]

Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019)

work page arXiv 2019
[4]

and contributors

Artifex Software, Inc. and contributors. 2016. PyMuPDF: High-performance Python PDF library. https://github.com/pymupdf/PyMuPDF. https://github.com/ pymupdf/PyMuPDF

2016
[5]

Hannah Bast and Claudius Korzen. 2017. A benchmark and evaluation for text extraction from PDF. In2017 ACM/IEEE joint conference on digital libraries (JCDL) . IEEE, 1–10

2017
[6]

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al
[7]

arXiv preprint arXiv:2109.00122 , year=

Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122 (2021)

work page arXiv 2021
[8]

Tobias Daudert, Paul Buitelaar, and Sapna Negi. 2018. Leveraging news senti- ment to improve microblog sentiment classification in the financial domain. In Proceedings of the first workshop on economics and natural language processing . 49–54

2018
[9]

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241 (2018)

work page arXiv 2018
[10]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 150–158

2024
[11]

Joao Filgueiras, Luís Barbosa, Gil Rocha, Henrique Lopes Cardoso, Luís Paulo Reis, Joao Pedro Machado, and Ana Maria Oliveira. 2019. Complaint analysis and classification for economic and food safety. InProceedings of the Second Workshop on Economics and Natural Language Processing . 51–60

2019
[12]

Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2024. Ragbench: Explain- able benchmark for retrieval-augmented generation systems. arXiv preprint arXiv:2407.11005 (2024)

work page arXiv 2024
[13]

Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, and Guoping Hu. 2025. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv preprint arXiv:2504.14891 (2025)

work page arXiv 2025
[14]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented gen- eration for large language models: A survey. arXiv preprint arXiv:2312.10997 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

work page internal anchor Pith review arXiv 2021
[16]

Daniel Gozman and Wendy Currie. 2014. The role of investment management systems in regulatory compliance: A post-financial crisis study of displacement mechanisms. Journal of Information Technology 29, 1 (2014), 44–58

2014
[17]

Jingguang Han, Utsab Barman, Jer Hayes, Jinhua Du, Edward Burgin, and Dadong Wan. 2018. Nextgen aml: Distributed deep learning based language technologies to augment anti money laundering investigation. Association for Computational Linguistics

2018
[18]

Paul Hopkin. 2018. Fundamentals of risk management: understanding, evaluating and implementing effective risk management . Kogan Page Publishers

2018
[19]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944 (2023)

work page arXiv 2023
[21]

Gautier Izacard. 2021. PyLate: Python Library for ColBERT-Style Indexing. https: //github.com/lightonai/pylate

2021
[22]

Jeremy Singer-Vine and contributors. 2013. pdfplumber: Python PDF Parsing and Extraction Library. https://github.com/jsvine/pdfplumber. https://github. com/jsvine/pdfplumber

2013
[23]

Thorsten Joachims and Nick Craswell. 2021. ir-measures: A Library for Evaluating Information Retrieval. https://github.com/terrierteam/ir_measures

2021
[24]

Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde
[25]

In International Confer- ence on Algorithmic Learning Theory

On the computational complexity of self-attention. In International Confer- ence on Algorithmic Learning Theory . PMLR, 597–619
[26]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

2020
[27]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024)

work page arXiv 2024
[28]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

2020
[29]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579 (2024)

work page internal anchor Pith review arXiv 2024
[30]

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941–1942

2018
[31]

Phillip Massa and Julian McAuley. 2020. rank_bm25: A Python Implementation of Okapi BM25. https://github.com/dorianbrown/rank_bm25

2020
[32]

Norman Meuschke, Apurva Jagdale, Timo Spinde, Jelena Mitrović, and Bela Gipp
[33]

In International Conference on Information

A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents. In International Conference on Information. Springer, 383–405
[34]

Inc. Ollama. 2024. Ollama: Local LLM Inference Toolkit. https://ollama.com

2024
[35]

pdfminer.six contributors. 2018. pdfminer.six: Python PDF Parsing Library. https: //github.com/pdfminer/pdfminer.six. https://github.com/pdfminer/pdfminer.six

2018
[36]

and contributors

Phaseit, Inc. and contributors. 2012. PyPDF2: Pure-Python PDF toolkit. https: //github.com/py-pdf/PyPDF2. https://github.com/py-pdf/PyPDF2

2012
[37]

pypdfium2-team. 2023. pypdfium2: Python bindings for PDFium. https: //github.com/pypdfium2-team/pypdfium2. https://github.com/pypdfium2-team/ pypdfium2

2023
[38]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389

2009
[39]

Daniel Ruffinelli and Joshua Martel. 2022. sparsembed: SPLADE Dense+Sparse Embedding Library. https://github.com/naver/splade

2022
[40]

Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv preprint arXiv:2103.15348 (2021)

work page arXiv 2021
[41]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Hugging Face Team. 2023. Transformers: State-of-the-Art Natural Language Processing. https://github.com/huggingface/transformers

2023
[43]

Unstructured Technologies and contributors. 2022. Unstructured: Python library for document preprocessing. https://github.com/Unstructured-IO/unstructured. https://github.com/Unstructured-IO/unstructured

2022
[44]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024)

work page internal anchor Pith review arXiv 2024
[45]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

2022
[46]

Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations

2023
[47]

Yi Yang, Mark Christopher Siy Uy, and Allen Huang. 2020. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097 (2020)

work page arXiv 2020
[48]

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li
[49]

Financial report chunking for effective retrieval augmented generation,

Financial report chunking for effective retrieval augmented generation. arXiv preprint arXiv:2402.05131 (2024)

work page arXiv 2024
[50]

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024. Evaluation of retrieval-augmented generation: A survey. In CCF Conference on Big Data. Springer, 102–120. 728 ICSE-SEIP ’26, April 12–18, 2026, Rio de Janeiro, Brazil El Bachyr, et al

2024
[51]

Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. 2022. MultiHiertt: Numer- ical reasoning over multi hierarchical tabular and textual data. arXiv preprint arXiv:2206.01347 (2022)

work page arXiv 2022
[52]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624 (2021). 729

work page arXiv 2021