pith:MRWBTX6A
DocAtlas: Multilingual Document Understanding Across 80+ Languages
Direct preference optimization with rendering-derived ground truth improves multilingual document understanding across 82 languages without base-language degradation.
arxiv:2605.12623 v1 · 2026-05-12 · cs.CL · cs.CV · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MRWBTX6ANSALCH2ALUNUDRRNUM}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%.
The differential rendering of native DOCX documents and synthetic LaTeX-based generation produce precise structural annotations that accurately represent real-world document distributions across all 82 languages without introducing new biases or distribution shifts.
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
References
Formal links
Receipt and verification
| First computed | 2026-05-18T03:10:00.376172Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
646c19dfc06c80b11f405d1b41c62da33bf7c529cf62a53523b492dc102f7538
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MRWBTX6ANSALCH2ALUNUDRRNUM \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 646c19dfc06c80b11f405d1b41c62da33bf7c529cf62a53523b492dc102f7538
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "a95951fe49943af07040052cbe2b31c810d39c9c6e61cca1684157a13fe4133d",
"cross_cats_sorted": [
"cs.CV",
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-05-12T18:09:38Z",
"title_canon_sha256": "fc124a2f6bfc9f9191983bcdf92128c854775df7e9d78a1b75184569c9a8918c"
},
"schema_version": "1.0",
"source": {
"id": "2605.12623",
"kind": "arxiv",
"version": 1
}
}