pith. sign in
Pith Number

pith:B7LAWZNL

pith:2026:B7LAWZNLDUKNYWFKMMPRTW5SII
not attested not anchored not stored refs pending

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira

Metadata enrichment and hierarchy-aware chunking improve RAG accuracy more than the choice of PDF conversion framework.

arxiv:2604.04948 v2 · 2026-03-30 · cs.IR · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{B7LAWZNLDUKNYWFKMMPRTW5SII}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone.

C2weakest assumption

That LLM-as-judge scoring on 50 questions reliably measures true downstream question-answering quality without human validation or error bars on the judge itself.

C3one line summary

Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-27T01:04:57.708295Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0fd60b65ab1d14dc58aa631f19dbb2421849346328f92888ecdfb939b4b1f9fa

Aliases

arxiv: 2604.04948 · arxiv_version: 2604.04948v2 · doi: 10.48550/arxiv.2604.04948 · pith_short_12: B7LAWZNLDUKN · pith_short_16: B7LAWZNLDUKNYWFK · pith_short_8: B7LAWZNL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/B7LAWZNLDUKNYWFKMMPRTW5SII \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0fd60b65ab1d14dc58aa631f19dbb2421849346328f92888ecdfb939b4b1f9fa
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1c5abb90893fa4fa1e425bfe25f21a3aac15fc4115cdbe716923fe74ca2c39ab",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.IR",
    "submitted_at": "2026-03-30T14:40:58Z",
    "title_canon_sha256": "faed6d372f21fc87b008678a97c37c4731d601fe02c585039e7af4f50ca62bfe"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.04948",
    "kind": "arxiv",
    "version": 2
  }
}