ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

Adrian Charkiewicz; Dawid Wi\'sniewski; Kamil Guttmann; Micha{\l} Ciesi\'o{\l}ka

arxiv: 2605.15794 · v1 · pith:LSRWB36Enew · submitted 2026-05-15 · 💻 cs.CL

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

Micha{\l} Ciesi\'o{\l}ka , Dawid Wi\'sniewski , Adrian Charkiewicz , Kamil Guttmann This is my paper

Pith reviewed 2026-05-20 18:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal machine translationPDF translationlayout preservationspatial groundingdocument reconstructiondataset benchmarkgeometric features

0 comments

The pith

A dataset of nearly 4,000 multilingual PDFs shows standard translation systems routinely lose the visual layout and spatial links between text and page elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ForMaT as a parallel collection of 3,956 PDFs covering 15 language pairs while retaining original layout metadata such as positions of text, tables, and formulas. Evaluation on this resource demonstrates that typical machine translation systems break the connection between translated content and its geometric context on the page. The resulting benchmark is intended to support new models that combine visual layout signals with textual translation to produce reconstructed documents that stay close to the source structure. By focusing on visually diverse documents, the work isolates the specific failure mode of spatial desynchronization in current approaches.

Core claim

ForMaT supplies a parallel corpus of PDFs that keeps layout metadata intact across languages. Tests with existing systems on this corpus show repeated loss of spatial grounding, where text no longer aligns with its original visual surroundings after translation. The dataset therefore supplies the concrete test cases needed to develop translation methods that treat layout as an integral part of the output rather than an afterthought.

What carries the argument

The ForMaT parallel corpus, built by K-Medoids sampling across 45 geometric features to retain layout metadata while selecting for structural variety.

If this is right

Layout-aware models that receive both text and geometric features can produce higher-fidelity reconstructed documents.
Translation pipelines will need explicit mechanisms for geometric synchronization to avoid losing visual context.
Benchmarks focused on complex elements such as nested tables and formulas will drive measurable progress on document-level translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling strategy could be applied to other document formats to test whether layout failures are PDF-specific or general.
New evaluation metrics that score both textual accuracy and positional fidelity would follow directly from using this dataset.
Practical tools for translating technical reports or contracts could incorporate the dataset to enforce layout consistency.

Load-bearing premise

That clustering PDFs by 45 geometric features produces a set of documents whose layout challenges are representative and free of selection bias.

What would settle it

Run standard machine translation pipelines on the ForMaT test splits and measure the rate at which translated text blocks and elements fall outside their original bounding-box positions or break table and formula alignments.

Figures

Figures reproduced from arXiv: 2605.15794 by Adrian Charkiewicz, Dawid Wi\'sniewski, Kamil Guttmann, Micha{\l} Ciesi\'o{\l}ka.

**Figure 1.** Figure 1: ForMaT dataset collection process. Each operation was performed independently for each language pair in both domains. set resulted in unrepresented language pairs at the sampling stage. 3.2 Data sampling To balance data across the two primary domains and fifteen language pairs, we targeted a sample of 1,000 documents per pair in each domain. We adopted a quota sampling strategy (Cochran, 1977) with two mod… view at source ↗

**Figure 2.** Figure 2: Spearman correlation matrix of document complexity metrics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Bounding box area distribution on a logarithmic scale. The concentration of "micro-entities" (indicated by the peak at log10 Area ≈ 4.0) highlights the high degree of layout fragmentation. We analyzed the physical scale of the document components. By examining the distribution of bounding box (BBox) areas on a logarithmic scale, we identified a high degree of layout fragmentation. As shown in the BBox are… view at source ↗

**Figure 3.** Figure 3: Distribution of horizontal layout entropy across documents. Low entropy indicates columnar layouts with predictable vertical alignment of text blocks, while high entropy reflects chaotic layouts with irregular spatial distribution and disrupted reading order. Beyond simple entity counts, we measured the spatial organization of content using horizontal layout entropy (H) seen in [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 5.** Figure 5: Distribution of the Overall Fill Factor across the corpus. Finally, we quantified the physical organization of the corpus using fill factor analysis, which measures the ratio of bounding box areas to the total page area. This metric provides a macroscopic view of document saturation, allowing us to categorize the corpus into distinct layout types. As illustrated in the multi-modal distribution of [PITH_… view at source ↗

**Figure 6.** Figure 6: Text Area Ratio per page [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Table cells translation error presenting different semantic meaning to each cell [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Image caption translation error introduced by missing image context. 5.1.2 Structural Errors (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Dual-column numbered list reconstruction error. The system misplaced newline characters and resized gray background, breaking the parallel alignment [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 13.** Figure 13: Translation system losing semantic translation context between lines and misplacing the text underline. (a) Source text (b) Original target text (c) Translation system result (d) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: The ground-truth translation correctly renders "Adoption" as "Przyj˛ecie" (Legal Adoption/Approval) to match the legislative subject matter. However, the tested systems exhibit a significant domain mismatch, mistranslating the term as "Adopcja" (Biological/Family Adoption). This error stems from a loss of contextual continuity between layout elements [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 16.** Figure 16: Inline asset collision and anchor failure. The original document features functional icons embedded within the text flow. The translation system fails to account for these inline graphical assets during the reconstruction phase. (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 15.** Figure 15: Geometric synchronization failure and layer detachment. In the original document, the compliance statement is properly encapsulated within a table structure. However, the translation system fails to maintain the link between the bounding box and its textual content. (a) Source text (b) Translation system result [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 17.** Figure 17: Failure in structural reconstruction and stylistic preservation. In the reconstructed output, the system fails to preserve the typographic weight and the pink-colored indices, rendering all elements in a default black font. Furthermore, the translation exhibits a significant vertical alignment drift [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

read the original abstract

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ForMaT ships a new layout-preserving PDF dataset selected by K-Medoids on geometric features, which could serve as a benchmark for multimodal MT, though the evidence that current systems specifically fail on spatial grounding stays thin.

read the letter

The paper's main contribution is a parallel corpus of 3,956 PDFs spanning 15 language pairs that keeps layout metadata intact. They select the documents with K-Medoids clustering over 45 geometric features to target visually complex material like nested tables and formulas. That sampling step is the clearest piece of new work; it gives a reproducible way to pull structurally varied PDFs without hand-picking everything.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ForMaT, a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata. The authors use K-Medoids sampling over 45 geometric features to select structurally diverse, visually complex documents containing elements such as nested tables and formulas. Evaluation indicates that current MT systems struggle with spatial grounding and geometric synchronization, losing links between text and visual context; the dataset is positioned as a benchmark for layout-aware multimodal translation models.

Significance. If the sampling procedure demonstrably covers a broad range of layout families and the reported MT failures are shown to stem specifically from missing visual context rather than other factors, ForMaT could become a useful resource for research on document-level multimodal MT and layout-preserving translation.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.
[§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.

minor comments (1)

[Abstract] Abstract: The number of documents per language pair and the exact set of 15 languages are not stated, which would help readers assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The K-Medoids clustering on 45 geometric features is presented without any reported validation of cluster coverage, silhouette scores, or explicit checks that key layout families (multi-column articles, dense infographics, form-like documents) are represented in the final 3,956-document set. This directly affects the central claim that the corpus supplies a structurally diverse benchmark for testing geometric synchronization.

Authors: We agree that the original submission did not report validation metrics for the K-Medoids procedure. In the revised manuscript we will add to §3 a silhouette score analysis for the selected number of clusters together with a quantitative breakdown of layout family coverage (multi-column articles, dense infographics, form-like documents) derived from the 45 geometric features. This will be accompanied by illustrative examples and feature-distribution plots to substantiate the claim of structural diversity. revision: yes
Referee: [§4 (Evaluation)] §4 (Evaluation): The claim that MT systems 'struggle with spatial grounding' is stated without accompanying quantitative metrics, baseline comparisons, or error analysis that isolates layout-related failures from other translation errors; this weakens the diagnostic value of the benchmark.

Authors: We acknowledge that the evaluation section would benefit from additional quantitative support. The revised §4 will incorporate layout-specific metrics (e.g., bounding-box overlap and geometric alignment error), direct comparisons against text-only MT baselines, and a categorized error analysis distinguishing layout-related failures from content-related ones. These additions will strengthen the diagnostic value of ForMaT for layout-aware translation research. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset sampling and benchmark claim are independent of inputs

full rationale

The paper constructs ForMaT by applying K-Medoids clustering to 45 geometric features extracted from PDFs, then evaluates existing MT systems on the resulting corpus. This selection procedure is a one-way preprocessing step whose output (the 3,956-document set) is not fed back into any derivation or prediction that would make the diversity claim tautological. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the abstract or described construction; the claim that current MT systems lose spatial grounding is an empirical observation on the held-out data rather than a restatement of the sampling method itself. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of K-Medoids clustering for visual diversity and the representativeness of the selected PDFs; no new physical entities are introduced, and assumptions are standard in dataset curation rather than ad-hoc inventions.

free parameters (2)

number of geometric features
45 features chosen by hand to capture complex elements like nested tables and formulas.
K in K-Medoids
Number of clusters for sampling not specified in abstract but required for the diversity selection process.

axioms (1)

domain assumption K-Medoids sampling over geometric features captures structural diversity in PDFs
Invoked to justify selection of visually diverse documents for the benchmark.

pith-pipeline@v0.9.0 · 5647 in / 1272 out tokens · 41368 ms · 2026-05-20T18:44:58.180429+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To capture maximum structural and stylistic diversity, we employed the K-Medoids clustering algorithm to select representative documents from our vectorized pool... clustering the documents into K groups and selecting the medoid... K=100 distinct groups per language pair
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We performed clustering independently for each language pair within the two domains... using the Euclidean distance metric

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

[Cochran1977] Cochran, William G

Gemini: A family of highly capable multimodal models. [Cochran1977] Cochran, William G. 1977.Sampling Techniques. John Wiley & Sons, New York, NY , 3rd edition. [Cui et al.2025] Cui, Cheng, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zh...

work page 1977
[2]

PaddleOCR 3.0 technical report. July. [Elliott et al.2016] Elliott, Desmond, Stella Frank, Khalil Sima’an, and Lucia Specia

work page 2016
[3]

Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Lan- guage, hosted by the 54th Annual Meeting of the As- sociation for Computational Linguistics, VL@ACL 2016, August 12, Berlin, Germany. The Association for Computer Linguistics. [Feng et al.2025] Feng, Yi, Chuanyi Li, Jiatong He, Zhenyu Hou, and V...

work page 2016
[4]

Multimodal neural machine translation: A survey of the state of the art. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 22130–22147. Association for Computational Lin- gu...

work page 2025
[5]

TranslateGemma technical report.arXiv:2601.09012, 2026

Translategemma technical report.arXiv preprint arXiv:2601.09012. [Futeral et al.2025] Futeral, Matthieu, Cordelia Schmid, Benoît Sagot, and Rachel Bawden

work page arXiv 2025
[6]

In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April

To- wards zero-shot multimodal machine translation. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April. Association for Com- putational Linguistics. [Hsu et al.2024] Hsu, Benjamin, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nad...

work page 2025
[7]

M3T: A new bench- mark dataset for multi-modal document-level ma- chine translation. In Duh, Kevin, Helena Gómez- Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies: Short Pa- pers, NAACL 2024, Mexico City, Mexico, June 16- 2...

work page 2024
[8]

[Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

Layoutlmv3: Pre-training for document ai with unified text and image masking. [Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

work page 2026
[9]

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668. [Kaufman and Rousseeuw1990] Kaufman, Leonard and Peter Rousseeuw. 1990.Finding Groups in Data: An Introduction To Cluster Analysis

work page arXiv 1990
[10]

[Liu et al.2023] Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee

work page 2023
[11]

In Oh, A., T

Visual instruction tuning. In Oh, A., T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc. [Lu et al.2025] Lu, Jinghui, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, ...

work page 2025
[12]

A bounding box is worth one to- ken - interleaving layout and text in a large language model for document understanding. In Che, Wanx- iang, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Findings of the As- sociation for Computational Linguistics: ACL 2025, pages 7252–7273, Vienna, Austria, July. Association for Computational Li...

work page 2025
[13]

[O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

Global-local dual perception for mllms in high-resolution text-rich image transla- tion. [O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

work page 2025
[14]

[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

Dochplt: A massively multilingual document-level translation dataset.CoRR, abs/2508.13079. [Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

work page arXiv 2024
[15]

[Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu

A survey on multi-modal machine translation: Tasks, methods and challenges. [Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu. 2025a. PP-DocLayout: A unified docu- ment layout detection model to accelerate large-scale data construction. March. [Sun et al.2025b] Sun, Yirong, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, and Xiaoyu Shen. 202...

work page 2025
[16]

TaBERT: Pre- training for joint understanding of textual and tabular data. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8413–8426, Online, July. Association for Computational Linguistics. [Yin et al.2024] Yin, Shukang, Chaoyou Fu...

work page 2024
[17]

[Zhang et al.2025a] Zhang, Le, Qian Yang, and Aish- warya Agrawal. 2025a. Assessing and learning alignment of unimodal vision and language mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 14604–14614. Com- puter Vision Foundation / IEEE. [Zhang et al.2025b] Zhang, Yaping, Yu...

work page 2025
[18]

[Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

Doclayout-yolo: En- hancing document layout analysis through diverse synthetic data and global-to-local adaptive percep- tion. [Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

work page 2025
[19]

Tylko tyłem

InImage- Trans: Multimodal LLM-based text image machine translation. In Che, Wanxiang, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, ed- itors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20256–20277, Vienna, Austria, July. Association for Computational Lin- guistics. Appendix A. Vector Indices Mapping Tabl...

work page 2025

[1] [1]

[Cochran1977] Cochran, William G

Gemini: A family of highly capable multimodal models. [Cochran1977] Cochran, William G. 1977.Sampling Techniques. John Wiley & Sons, New York, NY , 3rd edition. [Cui et al.2025] Cui, Cheng, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zh...

work page 1977

[2] [2]

PaddleOCR 3.0 technical report. July. [Elliott et al.2016] Elliott, Desmond, Stella Frank, Khalil Sima’an, and Lucia Specia

work page 2016

[3] [3]

Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Lan- guage, hosted by the 54th Annual Meeting of the As- sociation for Computational Linguistics, VL@ACL 2016, August 12, Berlin, Germany. The Association for Computer Linguistics. [Feng et al.2025] Feng, Yi, Chuanyi Li, Jiatong He, Zhenyu Hou, and V...

work page 2016

[4] [4]

Multimodal neural machine translation: A survey of the state of the art. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 22130–22147. Association for Computational Lin- gu...

work page 2025

[5] [5]

TranslateGemma technical report.arXiv:2601.09012, 2026

Translategemma technical report.arXiv preprint arXiv:2601.09012. [Futeral et al.2025] Futeral, Matthieu, Cordelia Schmid, Benoît Sagot, and Rachel Bawden

work page arXiv 2025

[6] [6]

In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April

To- wards zero-shot multimodal machine translation. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, edi- tors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 761–778, Albu- querque, New Mexico, April. Association for Com- putational Linguistics. [Hsu et al.2024] Hsu, Benjamin, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nad...

work page 2025

[7] [7]

M3T: A new bench- mark dataset for multi-modal document-level ma- chine translation. In Duh, Kevin, Helena Gómez- Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies: Short Pa- pers, NAACL 2024, Mexico City, Mexico, June 16- 2...

work page 2024

[8] [8]

[Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

Layoutlmv3: Pre-training for document ai with unified text and image masking. [Huang et al.2026] Huang, Ailin, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al

work page 2026

[9] [9]

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668. [Kaufman and Rousseeuw1990] Kaufman, Leonard and Peter Rousseeuw. 1990.Finding Groups in Data: An Introduction To Cluster Analysis

work page arXiv 1990

[10] [10]

[Liu et al.2023] Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee

work page 2023

[11] [11]

In Oh, A., T

Visual instruction tuning. In Oh, A., T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc. [Lu et al.2025] Lu, Jinghui, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, ...

work page 2025

[12] [12]

A bounding box is worth one to- ken - interleaving layout and text in a large language model for document understanding. In Che, Wanx- iang, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Findings of the As- sociation for Computational Linguistics: ACL 2025, pages 7252–7273, Vienna, Austria, July. Association for Computational Li...

work page 2025

[13] [13]

[O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

Global-local dual perception for mllms in high-resolution text-rich image transla- tion. [O’Brien et al.2025] O’Brien, Dayyán, Bhavitvya Ma- lik, Ona de Gibert, Pinzhen Chen, Barry Had- dow, and Jörg Tiedemann

work page 2025

[14] [14]

[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

Dochplt: A massively multilingual document-level translation dataset.CoRR, abs/2508.13079. [Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

work page arXiv 2024

[15] [15]

[Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu

A survey on multi-modal machine translation: Tasks, methods and challenges. [Sun et al.2025a] Sun, Ting, Cheng Cui, Yuning Du, and Yi Liu. 2025a. PP-DocLayout: A unified docu- ment layout detection model to accelerate large-scale data construction. March. [Sun et al.2025b] Sun, Yirong, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, and Xiaoyu Shen. 202...

work page 2025

[16] [16]

TaBERT: Pre- training for joint understanding of textual and tabular data. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8413–8426, Online, July. Association for Computational Linguistics. [Yin et al.2024] Yin, Shukang, Chaoyou Fu...

work page 2024

[17] [17]

[Zhang et al.2025a] Zhang, Le, Qian Yang, and Aish- warya Agrawal. 2025a. Assessing and learning alignment of unimodal vision and language mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 14604–14614. Com- puter Vision Foundation / IEEE. [Zhang et al.2025b] Zhang, Yaping, Yu...

work page 2025

[18] [18]

[Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

Doclayout-yolo: En- hancing document layout analysis through diverse synthetic data and global-to-local adaptive percep- tion. [Zuo et al.2025] Zuo, Fei, Kehai Chen, Yu Zhang, Zhengshan Xue, and Min Zhang

work page 2025

[19] [19]

Tylko tyłem

InImage- Trans: Multimodal LLM-based text image machine translation. In Che, Wanxiang, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, ed- itors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20256–20277, Vienna, Austria, July. Association for Computational Lin- guistics. Appendix A. Vector Indices Mapping Tabl...

work page 2025