pith. sign in
Pith Number

pith:MRWBTX6A

pith:2026:MRWBTX6ANSALCH2ALUNUDRRNUM
not attested not anchored not stored refs resolved

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Abdullah Sohail, Ahmed Heakl, Ahmed Nassar, Fahad Shahbaz Khan, Imran Razzak, Peter W. J. Staar, Rania Elbadry, Salman Khan, Youssef Mohamed

Direct preference optimization with rendering-derived ground truth improves multilingual document understanding across 82 languages without base-language degradation.

arxiv:2605.12623 v1 · 2026-05-12 · cs.CL · cs.CV · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MRWBTX6ANSALCH2ALUNUDRRNUM}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%.

C2weakest assumption

The differential rendering of native DOCX documents and synthetic LaTeX-based generation produce precise structural annotations that accurately represent real-world document distributions across all 82 languages without introducing new biases or distribution shifts.

C3one line summary

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.

References

85 extracted · 85 resolved · 14 Pith anchors

[1] FirstName LastName , title =
[2] FirstName Alpher , title =
[3] Journal of Foo , volume = 13, number = 1, pages =
[4] Journal of Foo , volume = 14, number = 1, pages =
[5] FirstName Alpher and FirstName Gamow , title =

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-18T03:10:00.376172Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

646c19dfc06c80b11f405d1b41c62da33bf7c529cf62a53523b492dc102f7538

Aliases

arxiv: 2605.12623 · arxiv_version: 2605.12623v1 · doi: 10.48550/arxiv.2605.12623 · pith_short_12: MRWBTX6ANSAL · pith_short_16: MRWBTX6ANSALCH2A · pith_short_8: MRWBTX6A
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MRWBTX6ANSALCH2ALUNUDRRNUM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 646c19dfc06c80b11f405d1b41c62da33bf7c529cf62a53523b492dc102f7538
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "a95951fe49943af07040052cbe2b31c810d39c9c6e61cca1684157a13fe4133d",
    "cross_cats_sorted": [
      "cs.CV",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-12T18:09:38Z",
    "title_canon_sha256": "fc124a2f6bfc9f9191983bcdf92128c854775df7e9d78a1b75184569c9a8918c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12623",
    "kind": "arxiv",
    "version": 1
  }
}