pith. sign in

arxiv: 2605.15794 · v1 · pith:LSRWB36Enew · submitted 2026-05-15 · 💻 cs.CL

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

Pith reviewed 2026-05-20 18:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal machine translationPDF translationlayout preservationspatial groundingdocument reconstructiondataset benchmarkgeometric features
0
0 comments X

The pith

A dataset of nearly 4,000 multilingual PDFs shows standard translation systems routinely lose the visual layout and spatial links between text and page elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ForMaT as a parallel collection of 3,956 PDFs covering 15 language pairs while retaining original layout metadata such as positions of text, tables, and formulas. Evaluation on this resource demonstrates that typical machine translation systems break the connection between translated content and its geometric context on the page. The resulting benchmark is intended to support new models that combine visual layout signals with textual translation to produce reconstructed documents that stay close to the source structure. By focusing on visually diverse documents, the work isolates the specific failure mode of spatial desynchronization in current approaches.

Core claim

ForMaT supplies a parallel corpus of PDFs that keeps layout metadata intact across languages. Tests with existing systems on this corpus show repeated loss of spatial grounding, where text no longer aligns with its original visual surroundings after translation. The dataset therefore supplies the concrete test cases needed to develop translation methods that treat layout as an integral part of the output rather than an afterthought.

What carries the argument

The ForMaT parallel corpus, built by K-Medoids sampling across 45 geometric features to retain layout metadata while selecting for structural variety.

If this is right

  • Layout-aware models that receive both text and geometric features can produce higher-fidelity reconstructed documents.
  • Translation pipelines will need explicit mechanisms for geometric synchronization to avoid losing visual context.
  • Benchmarks focused on complex elements such as nested tables and formulas will drive measurable progress on document-level translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling strategy could be applied to other document formats to test whether layout failures are PDF-specific or general.
  • New evaluation metrics that score both textual accuracy and positional fidelity would follow directly from using this dataset.
  • Practical tools for translating technical reports or contracts could incorporate the dataset to enforce layout consistency.

Load-bearing premise

That clustering PDFs by 45 geometric features produces a set of documents whose layout challenges are representative and free of selection bias.

What would settle it

Run standard machine translation pipelines on the ForMaT test splits and measure the rate at which translated text blocks and elements fall outside their original bounding-box positions or break table and formula alignments.

Figures

Figures reproduced from arXiv: 2605.15794 by Adrian Charkiewicz, Dawid Wi\'sniewski, Kamil Guttmann, Micha{\l} Ciesi\'o{\l}ka.

Figure 1
Figure 1. Figure 1: ForMaT dataset collection process. Each operation was performed independently for each language pair in both domains. set resulted in unrepresented language pairs at the sampling stage. 3.2 Data sampling To balance data across the two primary domains and fifteen language pairs, we targeted a sample of 1,000 documents per pair in each domain. We adopted a quota sampling strategy (Cochran, 1977) with two mod… view at source ↗
Figure 2
Figure 2. Figure 2: Spearman correlation matrix of document com￾plexity metrics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bounding box area distribution on a logarithmic scale. The concentration of "micro-entities" (indicated by the peak at log10 Area ≈ 4.0) highlights the high degree of layout fragmentation. We analyzed the physical scale of the document components. By examining the distribution of bounding box (BBox) areas on a logarithmic scale, we identified a high degree of layout fragmenta￾tion. As shown in the BBox are… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of horizontal layout entropy across documents. Low entropy indicates columnar layouts with pre￾dictable vertical alignment of text blocks, while high entropy reflects chaotic layouts with irregular spatial distribution and disrupted reading order. Beyond simple entity counts, we measured the spatial organization of content using horizontal layout entropy (H) seen in [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of the Overall Fill Factor across the corpus. Finally, we quantified the physical organization of the corpus using fill factor analysis, which mea￾sures the ratio of bounding box areas to the total page area. This metric provides a macroscopic view of document saturation, allowing us to cat￾egorize the corpus into distinct layout types. As illustrated in the multi-modal distribution of [PITH_… view at source ↗
Figure 6
Figure 6. Figure 6: Text Area Ratio per page [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Table cells translation error presenting dif￾ferent semantic meaning to each cell [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Image caption translation error introduced by missing image context. 5.1.2 Structural Errors (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Dual-column numbered list reconstruc￾tion error. The system misplaced newline characters and resized gray background, breaking the parallel alignment [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Translation system losing semantic trans￾lation context between lines and misplacing the text underline. (a) Source text (b) Original target text (c) Translation system result (d) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The ground-truth translation correctly renders "Adoption" as "Przyj˛ecie" (Legal Adop￾tion/Approval) to match the legislative subject matter. However, the tested systems exhibit a significant do￾main mismatch, mistranslating the term as "Adopcja" (Biological/Family Adoption). This error stems from a loss of contextual continuity between layout ele￾ments [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Inline asset collision and anchor failure. The original document features functional icons em￾bedded within the text flow. The translation system fails to account for these inline graphical assets dur￾ing the reconstruction phase. (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Geometric synchronization failure and layer detachment. In the original document, the com￾pliance statement is properly encapsulated within a table structure. However, the translation system fails to maintain the link between the bounding box and its textual content. (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure in structural reconstruction and stylistic preservation. In the reconstructed output, the system fails to preserve the typographic weight and the pink-colored indices, rendering all elements in a default black font. Furthermore, the translation ex￾hibits a significant vertical alignment drift [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
read the original abstract

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ForMaT, a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata. The authors use K-Medoids sampling over 45 geometric features to select structurally diverse, visually complex documents containing elements such as nested tables and formulas. Evaluation indicates that current MT systems struggle with spatial grounding and geometric synchronization, losing links between text and visual context; the dataset is positioned as a benchmark for layout-aware multimodal translation models.

Significance. If the sampling procedure demonstrably covers a broad range of layout families and the reported MT failures are shown to stem specifically from missing visual context rather than other factors, ForMaT could become a useful resource for research on document-level multimodal MT and layout-preserving translation.

major comments (2)
  1. [§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.
  2. [§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.
minor comments (1)
  1. [Abstract] Abstract: The number of documents per language pair and the exact set of 15 languages are not stated, which would help readers assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.

    Authors: We agree that the original submission did not report validation metrics for the K-Medoids procedure. In the revised manuscript we will add to §3 a silhouette score analysis for the selected number of clusters together with a quantitative breakdown of layout family coverage (multi-column articles, dense infographics, form-like documents) derived from the 45 geometric features. This will be accompanied by illustrative examples and feature-distribution plots to substantiate the claim of structural diversity. revision: yes

  2. Referee: [§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.

    Authors: We acknowledge that the evaluation section would benefit from additional quantitative support. The revised §4 will incorporate layout-specific metrics (e.g., bounding-box overlap and geometric alignment error), direct comparisons against text-only MT baselines, and a categorized error analysis distinguishing layout-related failures from content-related ones. These additions will strengthen the diagnostic value of ForMaT for layout-aware translation research. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset sampling and benchmark claim are independent of inputs

full rationale

The paper constructs ForMaT by applying K-Medoids clustering to 45 geometric features extracted from PDFs, then evaluates existing MT systems on the resulting corpus. This selection procedure is a one-way preprocessing step whose output (the 3,956-document set) is not fed back into any derivation or prediction that would make the diversity claim tautological. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the abstract or described construction; the claim that current MT systems lose spatial grounding is an empirical observation on the held-out data rather than a restatement of the sampling method itself. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of K-Medoids clustering for visual diversity and the representativeness of the selected PDFs; no new physical entities are introduced, and assumptions are standard in dataset curation rather than ad-hoc inventions.

free parameters (2)
  • number of geometric features
    45 features chosen by hand to capture complex elements like nested tables and formulas.
  • K in K-Medoids
    Number of clusters for sampling not specified in abstract but required for the diversity selection process.
axioms (1)
  • domain assumption K-Medoids sampling over geometric features captures structural diversity in PDFs
    Invoked to justify selection of visually diverse documents for the benchmark.

pith-pipeline@v0.9.0 · 5647 in / 1272 out tokens · 41368 ms · 2026-05-20T18:44:58.180429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    [Cochran1977] Cochran, William G

    Gemini: A family of highly capable multimodal models. [Cochran1977] Cochran, William G. 1977.Sampling Techniques. John Wiley & Sons, New York, NY , 3rd edition. [Cui et al.2025] Cui, Cheng, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zh...

  2. [2]

    PaddleOCR 3.0 technical report. July. [Elliott et al.2016] Elliott, Desmond, Stella Frank, Khalil Sima’an, and Lucia Specia

  3. [3]

    Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Lan- guage, hosted by the 54th Annual Meeting of the As- sociation for Computational Linguistics, VL@ACL 2016, August 12, Berlin, Germany. The Association for Computer Linguistics. [Feng et al.2025] Feng, Yi, Chuanyi Li, Jiatong He, Zhenyu Hou, and V...

  4. [4]

    Multimodal neural machine translation: A survey of the state of the art. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 22130–22147. Association for Computational Lin- gu...

  5. [5]

    TranslateGemma technical report.arXiv:2601.09012, 2026

    Translategemma technical report.arXiv preprint arXiv:2601.09012. [Futeral et al.2025] Futeral, Matthieu, Cordelia Schmid, Benoît Sagot, and Rachel Bawden

  6. [6]

    In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April

    To- wards zero-shot multimodal machine translation. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April. Association for Com- putational Linguistics. [Hsu et al.2024] Hsu, Benjamin, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nad...

  7. [7]

    M3T: A new bench- mark dataset for multi-modal document-level ma- chine translation. In Duh, Kevin, Helena Gómez- Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies: Short Pa- pers, NAACL 2024, Mexico City, Mexico, June 16- 2...

  8. [8]

    [Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

    Layoutlmv3: Pre-training for document ai with unified text and image masking. [Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

  9. [9]

    Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

    Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668. [Kaufman and Rousseeuw1990] Kaufman, Leonard and Peter Rousseeuw. 1990.Finding Groups in Data: An Introduction To Cluster Analysis

  10. [10]

    [Liu et al.2023] Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee

  11. [11]

    In Oh, A., T

    Visual instruction tuning. In Oh, A., T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc. [Lu et al.2025] Lu, Jinghui, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, ...

  12. [12]

    A bounding box is worth one to- ken - interleaving layout and text in a large language model for document understanding. In Che, Wanx- iang, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Findings of the As- sociation for Computational Linguistics: ACL 2025, pages 7252–7273, Vienna, Austria, July. Association for Computational Li...

  13. [13]

    [O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

    Global-local dual perception for mllms in high-resolution text-rich image transla- tion. [O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

  14. [14]

    [Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

    Dochplt: A massively multilingual document-level translation dataset.CoRR, abs/2508.13079. [Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

  15. [15]

    [Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu

    A survey on multi-modal machine translation: Tasks, methods and challenges. [Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu. 2025a. PP-DocLayout: A unified docu- ment layout detection model to accelerate large-scale data construction. March. [Sun et al.2025b] Sun, Yirong, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, and Xiaoyu Shen. 202...

  16. [16]

    TaBERT: Pre- training for joint understanding of textual and tabular data. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8413–8426, Online, July. Association for Computational Linguistics. [Yin et al.2024] Yin, Shukang, Chaoyou Fu...

  17. [17]

    [Zhang et al.2025a] Zhang, Le, Qian Yang, and Aish- warya Agrawal. 2025a. Assessing and learning alignment of unimodal vision and language mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 14604–14614. Com- puter Vision Foundation / IEEE. [Zhang et al.2025b] Zhang, Yaping, Yu...

  18. [18]

    [Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

    Doclayout-yolo: En- hancing document layout analysis through diverse synthetic data and global-to-local adaptive percep- tion. [Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

  19. [19]

    Tylko tyłem

    InImage- Trans: Multimodal LLM-based text image machine translation. In Che, Wanxiang, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, ed- itors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20256–20277, Vienna, Austria, July. Association for Computational Lin- guistics. Appendix A. Vector Indices Mapping Tabl...