ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation
Pith reviewed 2026-05-20 18:44 UTC · model grok-4.3
The pith
A dataset of nearly 4,000 multilingual PDFs shows standard translation systems routinely lose the visual layout and spatial links between text and page elements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForMaT supplies a parallel corpus of PDFs that keeps layout metadata intact across languages. Tests with existing systems on this corpus show repeated loss of spatial grounding, where text no longer aligns with its original visual surroundings after translation. The dataset therefore supplies the concrete test cases needed to develop translation methods that treat layout as an integral part of the output rather than an afterthought.
What carries the argument
The ForMaT parallel corpus, built by K-Medoids sampling across 45 geometric features to retain layout metadata while selecting for structural variety.
If this is right
- Layout-aware models that receive both text and geometric features can produce higher-fidelity reconstructed documents.
- Translation pipelines will need explicit mechanisms for geometric synchronization to avoid losing visual context.
- Benchmarks focused on complex elements such as nested tables and formulas will drive measurable progress on document-level translation.
Where Pith is reading between the lines
- The same sampling strategy could be applied to other document formats to test whether layout failures are PDF-specific or general.
- New evaluation metrics that score both textual accuracy and positional fidelity would follow directly from using this dataset.
- Practical tools for translating technical reports or contracts could incorporate the dataset to enforce layout consistency.
Load-bearing premise
That clustering PDFs by 45 geometric features produces a set of documents whose layout challenges are representative and free of selection bias.
What would settle it
Run standard machine translation pipelines on the ForMaT test splits and measure the rate at which translated text blocks and elements fall outside their original bounding-box positions or break table and formula alignments.
Figures
read the original abstract
We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ForMaT, a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata. The authors use K-Medoids sampling over 45 geometric features to select structurally diverse, visually complex documents containing elements such as nested tables and formulas. Evaluation indicates that current MT systems struggle with spatial grounding and geometric synchronization, losing links between text and visual context; the dataset is positioned as a benchmark for layout-aware multimodal translation models.
Significance. If the sampling procedure demonstrably covers a broad range of layout families and the reported MT failures are shown to stem specifically from missing visual context rather than other factors, ForMaT could become a useful resource for research on document-level multimodal MT and layout-preserving translation.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.
- [§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.
minor comments (1)
- [Abstract] Abstract: The number of documents per language pair and the exact set of 15 languages are not stated, which would help readers assess balance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.
Authors: We agree that the original submission did not report validation metrics for the K-Medoids procedure. In the revised manuscript we will add to §3 a silhouette score analysis for the selected number of clusters together with a quantitative breakdown of layout family coverage (multi-column articles, dense infographics, form-like documents) derived from the 45 geometric features. This will be accompanied by illustrative examples and feature-distribution plots to substantiate the claim of structural diversity. revision: yes
-
Referee: [§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.
Authors: We acknowledge that the evaluation section would benefit from additional quantitative support. The revised §4 will incorporate layout-specific metrics (e.g., bounding-box overlap and geometric alignment error), direct comparisons against text-only MT baselines, and a categorized error analysis distinguishing layout-related failures from content-related ones. These additions will strengthen the diagnostic value of ForMaT for layout-aware translation research. revision: yes
Circularity Check
No circularity: dataset sampling and benchmark claim are independent of inputs
full rationale
The paper constructs ForMaT by applying K-Medoids clustering to 45 geometric features extracted from PDFs, then evaluates existing MT systems on the resulting corpus. This selection procedure is a one-way preprocessing step whose output (the 3,956-document set) is not fed back into any derivation or prediction that would make the diversity claim tautological. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the abstract or described construction; the claim that current MT systems lose spatial grounding is an empirical observation on the held-out data rather than a restatement of the sampling method itself. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of geometric features
- K in K-Medoids
axioms (1)
- domain assumption K-Medoids sampling over geometric features captures structural diversity in PDFs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To capture maximum structural and stylistic diversity, we employed the K-Medoids clustering algorithm to select representative documents from our vectorized pool... clustering the documents into K groups and selecting the medoid... K=100 distinct groups per language pair
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We performed clustering independently for each language pair within the two domains... using the Euclidean distance metric
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[Cochran1977] Cochran, William G
Gemini: A family of highly capable multimodal models. [Cochran1977] Cochran, William G. 1977.Sampling Techniques. John Wiley & Sons, New York, NY , 3rd edition. [Cui et al.2025] Cui, Cheng, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zh...
work page 1977
-
[2]
PaddleOCR 3.0 technical report. July. [Elliott et al.2016] Elliott, Desmond, Stella Frank, Khalil Sima’an, and Lucia Specia
work page 2016
-
[3]
Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Lan- guage, hosted by the 54th Annual Meeting of the As- sociation for Computational Linguistics, VL@ACL 2016, August 12, Berlin, Germany. The Association for Computer Linguistics. [Feng et al.2025] Feng, Yi, Chuanyi Li, Jiatong He, Zhenyu Hou, and V...
work page 2016
-
[4]
Multimodal neural machine translation: A survey of the state of the art. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 22130–22147. Association for Computational Lin- gu...
work page 2025
-
[5]
TranslateGemma technical report.arXiv:2601.09012, 2026
Translategemma technical report.arXiv preprint arXiv:2601.09012. [Futeral et al.2025] Futeral, Matthieu, Cordelia Schmid, Benoît Sagot, and Rachel Bawden
-
[6]
To- wards zero-shot multimodal machine translation. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April. Association for Com- putational Linguistics. [Hsu et al.2024] Hsu, Benjamin, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nad...
work page 2025
-
[7]
M3T: A new bench- mark dataset for multi-modal document-level ma- chine translation. In Duh, Kevin, Helena Gómez- Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies: Short Pa- pers, NAACL 2024, Mexico City, Mexico, June 16- 2...
work page 2024
-
[8]
Layoutlmv3: Pre-training for document ai with unified text and image masking. [Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al
work page 2026
-
[9]
Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026
Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668. [Kaufman and Rousseeuw1990] Kaufman, Leonard and Peter Rousseeuw. 1990.Finding Groups in Data: An Introduction To Cluster Analysis
-
[10]
[Liu et al.2023] Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee
work page 2023
-
[11]
Visual instruction tuning. In Oh, A., T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc. [Lu et al.2025] Lu, Jinghui, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, ...
work page 2025
-
[12]
A bounding box is worth one to- ken - interleaving layout and text in a large language model for document understanding. In Che, Wanx- iang, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Findings of the As- sociation for Computational Linguistics: ACL 2025, pages 7252–7273, Vienna, Austria, July. Association for Computational Li...
work page 2025
-
[13]
Global-local dual perception for mllms in high-resolution text-rich image transla- tion. [O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann
work page 2025
-
[14]
[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su
Dochplt: A massively multilingual document-level translation dataset.CoRR, abs/2508.13079. [Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su
-
[15]
[Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu
A survey on multi-modal machine translation: Tasks, methods and challenges. [Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu. 2025a. PP-DocLayout: A unified docu- ment layout detection model to accelerate large-scale data construction. March. [Sun et al.2025b] Sun, Yirong, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, and Xiaoyu Shen. 202...
work page 2025
-
[16]
TaBERT: Pre- training for joint understanding of textual and tabular data. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8413–8426, Online, July. Association for Computational Linguistics. [Yin et al.2024] Yin, Shukang, Chaoyou Fu...
work page 2024
-
[17]
[Zhang et al.2025a] Zhang, Le, Qian Yang, and Aish- warya Agrawal. 2025a. Assessing and learning alignment of unimodal vision and language mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 14604–14614. Com- puter Vision Foundation / IEEE. [Zhang et al.2025b] Zhang, Yaping, Yu...
work page 2025
-
[18]
[Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang
Doclayout-yolo: En- hancing document layout analysis through diverse synthetic data and global-to-local adaptive percep- tion. [Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang
work page 2025
-
[19]
InImage- Trans: Multimodal LLM-based text image machine translation. In Che, Wanxiang, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, ed- itors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20256–20277, Vienna, Austria, July. Association for Computational Lin- guistics. Appendix A. Vector Indices Mapping Tabl...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.