From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Alexandre Sousa; Br\'igida M\'onica Faria; Henrique Lopes Cardoso; Jos\'e Duarte; Jos\'e Guilherme Marques dos Santos; Jos\'e Lu\'is Reis; Jos\'e Paulo Marques dos Santos; Lu\'is Paulo Reis; Pedro Pimenta; Ricardo Yang

arxiv: 2604.04948 · v2 · pith:B7LAWZNLnew · submitted 2026-03-30 · 💻 cs.IR · cs.AI· cs.LG

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Jos\'e Guilherme Marques dos Santos , Ricardo Yang , Rui Humberto Pereira , Alexandre Sousa , Br\'igida M\'onica Faria , Henrique Lopes Cardoso , Jos\'e Duarte , Jos\'e Lu\'is Reis

show 3 more authors

Lu\'is Paulo Reis Pedro Pimenta Jos\'e Paulo Marques dos Santos

This is my paper

Pith reviewed 2026-05-14 02:01 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords RAGPDF conversiondocument preprocessingquestion answeringmetadata enrichmenthierarchy-aware chunkingGraphRAGLLM evaluation

0 comments

The pith

Metadata enrichment and hierarchy-aware chunking improve RAG accuracy more than the choice of PDF conversion framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper systematically tests four PDF-to-Markdown tools across 19 pipeline variants on 36 Portuguese administrative documents and a 50-question benchmark. It finds that adding metadata and using hierarchy-aware chunking lifts accuracy to 94.1 percent, well above the 86.9 percent baseline from a naive loader and close to the 97.1 percent from manually curated Markdown. The conversion tool itself matters less than these preprocessing choices. Font-based hierarchy detection beats LLM-based methods, while a basic GraphRAG setup scores only 82 percent.

Core claim

The central claim is that data preparation quality is the dominant factor in RAG performance for domain-specific question answering, with metadata enrichment and hierarchy-aware chunking contributing more to accuracy than the specific PDF conversion framework, as shown by Docling with hierarchical splitting reaching 94.1 percent versus lower scores for other tool combinations.

What carries the argument

Hierarchy-aware chunking paired with metadata enrichment, which uses font information to rebuild document structure and produce better retrieval units.

If this is right

Font-based hierarchy rebuilding outperforms LLM-based structure detection.
Metadata enrichment and hierarchical splitting raise accuracy substantially over basic loaders.
Naive GraphRAG without ontological guidance underperforms standard RAG.
Manual curation sets an upper bound at 97.1 percent, leaving room for automated gains.
Including image descriptions during conversion aids performance on documents with visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar preprocessing emphasis could improve RAG results in other languages or document types.
Teams should allocate more effort to chunking and enrichment than to switching conversion tools.
Ontology-guided graph construction may be needed to make GraphRAG competitive.
Expanding the benchmark or adding human raters would test the stability of the LLM-judge results.

Load-bearing premise

That LLM-as-judge scores on 50 questions reliably measure true downstream question-answering quality without human validation or error bars.

What would settle it

A side-by-side human evaluation of answers from the best automated pipeline and the naive baseline on the same 50 questions.

Figures

Figures reproduced from arXiv: 2604.04948 by Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira.

**Figure 1.** Figure 1: ETL pipeline workflow. Raw PDFs from the Bronze layer are extracted and transformed into intermediate Markdown (Silver layer), then cleaned and finalized into RAG-ready Markdown with extracted assets (Gold layer). The pipeline supports several configurable transformation options designed to address known issues in framework outputs: HTML table cleaning (converting HTML tables to Markdown tables), LaTeX fo… view at source ↗

**Figure 3.** Figure 3: Knowledge graph data model. TextChunk nodes store text content with source metadata and embedding vectors. Entity nodes store a unique identifier, name, and semantic type. MENTIONS relationships link text chunks to the entities extracted from them. RELATED relationships capture semantic connections between entities. A semantic deduplication pipeline was subsequently applied to address entity duplication … view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: na\"ive PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives practical numbers on PDF tools for RAG but the LLM-judge scores on 50 questions lack validation so the rankings and relative-impact claims are not yet solid.

read the letter

The paper runs a head-to-head test of four PDF-to-Markdown converters on 36 Portuguese administrative PDFs, measuring downstream RAG QA accuracy across 19 configurations. Docling with hierarchical splitting and image descriptions reached 94.1 percent, beating the naive baseline at 86.9 percent and sitting close to the manual gold at 97.1 percent. They also tried an exploratory GraphRAG setup that scored lower at 82 percent. That is the concrete new data point: no earlier work had tied converter choice directly to QA accuracy on this kind of corpus with this many pipeline variants.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) across 19 pipeline configurations varying conversion tool, cleaning, splitting strategy, and metadata enrichment. On a manually curated 50-question benchmark over 36 Portuguese administrative documents (1,706 pages), LLM-as-judge scoring averaged over 10 runs shows Docling with hierarchical splitting and image descriptions reaching 94.1% accuracy, above a naive PDFLoader baseline (86.9%) but below manually curated Markdown (97.1%). The authors conclude that metadata enrichment and hierarchy-aware chunking contribute more to accuracy than conversion framework choice alone, while an exploratory GraphRAG scores only 82%.

Significance. If the accuracy attributions hold, the work supplies a useful empirical benchmark for RAG preprocessing in domain-specific settings, underscoring that data-preparation choices dominate over tool selection. Strengths include explicit baselines, multiple runs, a held-out question set, and a manually curated gold standard that ground the comparisons.

major comments (2)

[Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.
[Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.

minor comments (1)

[Abstract] Abstract: 'naïve' is rendered with an escaped quote; standardize to 'naive' for readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our evaluation methodology. We address the major points below and will incorporate statistical enhancements in the revision to improve rigor.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.

Authors: We acknowledge that the attribution of greater impact to metadata enrichment and hierarchy-aware chunking rests on patterns observed across the 19 configurations using LLM-as-judge scores. These patterns emerge from controlled variations where hierarchy and metadata were toggled independently of the conversion tool, with consistent gains (e.g., hierarchical splitting outperforming naive chunking across Docling, MinerU, and Marker). To address the concern, we will add per-configuration standard deviations as error bars, report per-question variance in an appendix, include confidence intervals, and apply paired statistical tests (e.g., t-tests with multiple-comparison correction) to the key differences. We will also expand the limitations section to note that LLM-as-judge may introduce format biases and that human correlation studies remain valuable future work. revision: partial
Referee: [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.

Authors: We agree that the absence of error bars and formal statistical tests limits interpretability of the 10-run averages. In the revised manuscript we will add mean ± standard deviation error bars to all tables and figures, and include appropriate tests (ANOVA followed by post-hoc pairwise comparisons with correction) to establish which differences between the 19 pipelines are statistically significant. This will clarify whether gaps such as the 94.1% vs. 86.9% baseline are robust to judge variability. revision: yes

standing simulated objections not resolved

A full human evaluation study with inter-annotator agreement to correlate against LLM-as-judge scores, which exceeds the scope and resources of the current revision.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external baselines

full rationale

The paper conducts a controlled empirical comparison of four PDF-to-Markdown tools across 19 configurations on a fixed 50-question set, reporting LLM-as-judge accuracies against two external baselines (naïve PDFLoader at 86.9% and manual Markdown at 97.1%). No derivations, equations, or first-principles results exist that could reduce to fitted parameters or self-referential definitions. Conclusions about metadata enrichment and hierarchy-aware chunking are drawn directly from observed accuracy deltas, not from any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The methodology is self-contained against the held-out question set and does not rename known results or import load-bearing premises from the authors' own citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the standard assumption that LLM-as-judge scores correlate with human judgments of answer quality and that the 50-question set is representative of real user needs in the domain.

axioms (1)

domain assumption LLM-as-judge produces reliable accuracy estimates for RAG outputs
Used to score all 19 configurations and baselines without reported human validation

pith-pipeline@v0.9.0 · 5599 in / 1172 out tokens · 58659 ms · 2026-05-14T02:01:10.447492+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.