pith. machine review for the scientific record. sign in

arxiv: 2604.12047 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.IR

Recognition: unknown

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords PDF parsingchunking strategiesRAGfinancial question answeringdocument structureTableQuestempirical studyinformation extraction
0
0 comments X

The pith

PDF parsers and chunking strategies significantly affect the performance of RAG systems for financial question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper empirically evaluates how different PDF parsers and chunking strategies influence retrieval-augmented generation for answering questions drawn from financial documents. It systematically tests multiple parsers alongside chunking methods that vary in overlap, using two benchmarks from the financial domain and introducing a new table-focused benchmark called TableQuest. The study measures how well these choices preserve document structure and lead to correct answers. Readers might care because PDFs with tables and heterogeneous content are common in finance, yet difficult for automated systems to process accurately. The results are intended to supply concrete guidelines for designing effective RAG pipelines.

Core claim

The central discovery is that different combinations of PDF parsers and chunking strategies, including those with varied overlap, produce measurable differences in preserving the structure of financial documents and in the correctness of answers generated by RAG systems, with the evaluation across benchmarks yielding practical guidelines for building robust PDF-understanding pipelines.

What carries the argument

The key machinery is the systematic examination of PDF parsers and chunking strategies with different overlaps, assessed by their synergy in maintaining document structure and answer accuracy within RAG for question answering on financial benchmarks.

If this is right

  • Optimal parser-chunker pairs improve handling of tables and mixed content in PDFs for better RAG results.
  • Chunk overlap levels interact with parser choice to affect performance.
  • The new TableQuest benchmark provides a targeted way to evaluate table understanding in financial QA.
  • These empirical findings directly inform the selection of components when constructing RAG systems for PDF documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar evaluation methods could be applied to PDFs in legal or scientific domains to derive domain-specific guidelines.
  • Future experiments might combine these parsing and chunking approaches with advanced layout analysis techniques for further gains.
  • Results may vary with different base language models or RAG architectures, suggesting the need for model-specific tuning.

Load-bearing premise

The assumption that the selected financial benchmarks and metrics capture the essential real-world challenges of PDF understanding and that findings generalize beyond the tested configurations.

What would settle it

A demonstration that an untested PDF parser or chunking method achieves higher accuracy on the same benchmarks, or that the recommended strategies underperform on a new financial dataset with different table structures, would challenge the guidelines.

Figures

Figures reproduced from arXiv: 2604.12047 by Anas Zilali, Anne Goujon, Jacques Klein, Omar El Bachyr, Saad Ezzini, Tegawend\'e F. Bissyand\'e, Ulrick Ble, Yewei Song.

Figure 1
Figure 1. Figure 1: An overview of the Retrieval-Augmented Genera [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TableQuest Dataset Construction Process. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end PDF-QA pipeline. Tablequest) is encoded and queried against the selected index to retrieve the top-K relevant pages. Those pages, together with the original question, form the prompt for one of several tested LLMs (see Table12), which generates an answer. In the evaluation stage, we (a) assess retrieval quality by measuring the pipeline’s ability to return the correct pages for each query, and (… view at source ↗
Figure 4
Figure 4. Figure 4: Parser stability in page-level MRR (Mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Heatmaps of page-level retrieval performance [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper performs an empirical study on the impact of PDF parsers and chunking strategies (including different overlap settings) on the performance of RAG-based question answering systems. It uses two financial domain benchmarks, including the newly proposed TableQuest dataset, to evaluate how these choices affect the preservation of document structure and the correctness of answers. The authors conclude that their results provide practical guidelines for developing robust RAG pipelines for PDF understanding.

Significance. If the experimental results are reliable, this study contributes empirical evidence on component interactions in RAG systems for handling complex PDF content like tables and text. The public release of TableQuest is a notable strength, as it provides a new resource for financial QA research. The work could inform practitioners in the financial sector on optimizing their PDF processing pipelines. However, the domain-specific nature of the benchmarks may constrain the generalizability of the proposed guidelines to broader PDF understanding tasks.

major comments (1)
  1. [Abstract] The central claim that the study offers 'practical guidelines for building robust RAG pipelines for PDF understanding' is not adequately supported by the experimental design. The evaluation is confined to financial-domain QA benchmarks, without testing on other document types (e.g., scientific papers with equations or multi-column layouts). This raises the risk that the identified synergies between parsers and chunking methods are artifacts of financial report formatting rather than general PDF properties, undermining the broader applicability asserted in the abstract and conclusions.
minor comments (1)
  1. Consider adding a summary table of the key parser-chunking combinations and their performance metrics across the benchmarks to improve the clarity of the results presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] The central claim that the study offers 'practical guidelines for building robust RAG pipelines for PDF understanding' is not adequately supported by the experimental design. The evaluation is confined to financial-domain QA benchmarks, without testing on other document types (e.g., scientific papers with equations or multi-column layouts). This raises the risk that the identified synergies between parsers and chunking methods are artifacts of financial report formatting rather than general PDF properties, undermining the broader applicability asserted in the abstract and conclusions.

    Authors: We agree that the experimental scope is limited to financial-domain benchmarks and that this constrains claims of broad applicability across all PDF types. Financial reports feature dense tabular structures and specific layouts that may not generalize to scientific papers with equations or other multi-column formats, so the observed parser-chunking synergies could partly reflect domain-specific formatting. To correct this, we will revise the abstract to state that the guidelines apply to financial PDF understanding, update the conclusions accordingly, and add an explicit limitations paragraph discussing domain specificity and the need for future cross-domain validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is a purely empirical evaluation of PDF parsers, chunking strategies, and their synergies on two financial-domain QA benchmarks (including a new TableQuest dataset). It reports experimental results to derive practical guidelines for RAG pipelines. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methodology. The guidelines are presented as direct outcomes of the systematic comparisons performed, with no reduction of claims to their own inputs by construction. This is a standard empirical study whose central claims rest on observed performance metrics rather than any internal circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical evaluation paper, the central claim rests on assumptions about benchmark representativeness and metric validity rather than mathematical derivations or new entities.

axioms (1)
  • domain assumption The selected financial benchmarks and metrics are representative proxies for PDF understanding performance in QA tasks.
    The study relies on TableQuest and the second benchmark being suitable for drawing general guidelines.

pith-pipeline@v0.9.0 · 5501 in / 1107 out tokens · 80708 ms · 2026-05-10T15:13:10.243774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    Narayan S Adhikari and Shradha Agarwal. 2024. A Comparative Study of PDF Parsing Tools Across Diverse Document Categories. arXiv preprint arXiv:2410.09871 (2024)

  2. [2]

    Adobe. 2025. What Is a PDF? Portable Document Format. https://www.adobe. com/acrobat/about-adobe-pdf.html. https://www.adobe.com/acrobat/about- adobe-pdf.html Accessed: 14 May 2025

  3. [3]

    Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019)

  4. [4]

    and contributors

    Artifex Software, Inc. and contributors. 2016. PyMuPDF: High-performance Python PDF library. https://github.com/pymupdf/PyMuPDF. https://github.com/ pymupdf/PyMuPDF

  5. [5]

    Hannah Bast and Claudius Korzen. 2017. A benchmark and evaluation for text extraction from PDF. In2017 ACM/IEEE joint conference on digital libraries (JCDL) . IEEE, 1–10

  6. [6]

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al

  7. [7]

    arXiv preprint arXiv:2109.00122 , year=

    Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122 (2021)

  8. [8]

    Tobias Daudert, Paul Buitelaar, and Sapna Negi. 2018. Leveraging news senti- ment to improve microblog sentiment classification in the financial domain. In Proceedings of the first workshop on economics and natural language processing . 49–54

  9. [9]

    Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241 (2018)

  10. [10]

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 150–158

  11. [11]

    Joao Filgueiras, Luís Barbosa, Gil Rocha, Henrique Lopes Cardoso, Luís Paulo Reis, Joao Pedro Machado, and Ana Maria Oliveira. 2019. Complaint analysis and classification for economic and food safety. InProceedings of the Second Workshop on Economics and Natural Language Processing . 51–60

  12. [12]

    Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2024. Ragbench: Explain- able benchmark for retrieval-augmented generation systems. arXiv preprint arXiv:2407.11005 (2024)

  13. [13]

    Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, and Guoping Hu. 2025. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv preprint arXiv:2504.14891 (2025)

  14. [14]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented gen- eration for large language models: A survey. arXiv preprint arXiv:2312.10997 2 (2023)

  15. [15]

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

  16. [16]

    Daniel Gozman and Wendy Currie. 2014. The role of investment management systems in regulatory compliance: A post-financial crisis study of displacement mechanisms. Journal of Information Technology 29, 1 (2014), 44–58

  17. [17]

    Jingguang Han, Utsab Barman, Jer Hayes, Jinhua Du, Edward Burgin, and Dadong Wan. 2018. Nextgen aml: Distributed deep learning based language technologies to augment anti money laundering investigation. Association for Computational Linguistics

  18. [18]

    Paul Hopkin. 2018. Fundamentals of risk management: understanding, evaluating and implementing effective risk management . Kogan Page Publishers

  19. [19]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  20. [20]

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944 (2023)

  21. [21]

    Gautier Izacard. 2021. PyLate: Python Library for ColBERT-Style Indexing. https: //github.com/lightonai/pylate

  22. [22]

    Jeremy Singer-Vine and contributors. 2013. pdfplumber: Python PDF Parsing and Extraction Library. https://github.com/jsvine/pdfplumber. https://github. com/jsvine/pdfplumber

  23. [23]

    Thorsten Joachims and Nick Craswell. 2021. ir-measures: A Library for Evaluating Information Retrieval. https://github.com/terrierteam/ir_measures

  24. [24]

    Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde

  25. [25]

    In International Confer- ence on Algorithmic Learning Theory

    On the computational complexity of self-attention. In International Confer- ence on Algorithmic Learning Theory . PMLR, 597–619

  26. [26]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

  27. [27]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024)

  28. [28]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

  29. [29]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579 (2024)

  30. [30]

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941–1942

  31. [31]

    Phillip Massa and Julian McAuley. 2020. rank_bm25: A Python Implementation of Okapi BM25. https://github.com/dorianbrown/rank_bm25

  32. [32]

    Norman Meuschke, Apurva Jagdale, Timo Spinde, Jelena Mitrović, and Bela Gipp

  33. [33]

    In International Conference on Information

    A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents. In International Conference on Information. Springer, 383–405

  34. [34]

    Inc. Ollama. 2024. Ollama: Local LLM Inference Toolkit. https://ollama.com

  35. [35]

    pdfminer.six contributors. 2018. pdfminer.six: Python PDF Parsing Library. https: //github.com/pdfminer/pdfminer.six. https://github.com/pdfminer/pdfminer.six

  36. [36]

    and contributors

    Phaseit, Inc. and contributors. 2012. PyPDF2: Pure-Python PDF toolkit. https: //github.com/py-pdf/PyPDF2. https://github.com/py-pdf/PyPDF2

  37. [37]

    pypdfium2-team. 2023. pypdfium2: Python bindings for PDFium. https: //github.com/pypdfium2-team/pypdfium2. https://github.com/pypdfium2-team/ pypdfium2

  38. [38]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389

  39. [39]

    Daniel Ruffinelli and Joshua Martel. 2022. sparsembed: SPLADE Dense+Sparse Embedding Library. https://github.com/naver/splade

  40. [40]

    Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv preprint arXiv:2103.15348 (2021)

  41. [41]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  42. [42]

    Hugging Face Team. 2023. Transformers: State-of-the-Art Natural Language Processing. https://github.com/huggingface/transformers

  43. [43]

    Unstructured Technologies and contributors. 2022. Unstructured: Python library for document preprocessing. https://github.com/Unstructured-IO/unstructured. https://github.com/Unstructured-IO/unstructured

  44. [44]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024)

  45. [45]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

  46. [46]

    Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations

  47. [47]

    Yi Yang, Mark Christopher Siy Uy, and Allen Huang. 2020. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097 (2020)

  48. [48]

    Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li

  49. [49]

    Financial report chunking for effective retrieval augmented generation,

    Financial report chunking for effective retrieval augmented generation. arXiv preprint arXiv:2402.05131 (2024)

  50. [50]

    Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024. Evaluation of retrieval-augmented generation: A survey. In CCF Conference on Big Data. Springer, 102–120. 728 ICSE-SEIP ’26, April 12–18, 2026, Rio de Janeiro, Brazil El Bachyr, et al

  51. [51]

    Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. 2022. MultiHiertt: Numer- ical reasoning over multi hierarchical tabular and textual data. arXiv preprint arXiv:2206.01347 (2022)

  52. [52]

    Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question an- swering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624 (2021). 729