PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

David Semedo; Diogo Gl\'oria-Silva; Diogo Tavares; Jo\~ao Cardeira; Jo\~ao Magalh\~aes; Manuel Letras da Luz; Rafael Ferreira

arxiv: 2606.19096 · v2 · pith:Z5KJ4MF6new · submitted 2026-06-17 · 💻 cs.CV

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

Jo\~ao Cardeira , Diogo Gl\'oria-Silva , Manuel Letras da Luz , Rafael Ferreira , Diogo Tavares , David Semedo , Jo\~ao Magalh\~aes This is my paper

Pith reviewed 2026-07-02 22:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords PorTEXTOEuropean PortugueseOCR benchmarkvisual text extractionmultilingual datapt-PTsynthetic vs real

0 comments

The pith

PorTEXTO shows specialized multilingual data improves European Portuguese OCR more than larger models or higher resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

European Portuguese has lacked benchmarks for modern visual text extraction, with existing ones limited to historical materials. PorTEXTO supplies the first dataset of contemporary, culturally relevant samples and evaluates models on them. The work finds a clear performance decline when models move from synthetic to real-world images. It establishes that specialized multilingual training data produces stronger pt-PT results than scaling model size or image resolution. This finding points toward data-focused strategies for advancing OCR in languages that have received less attention.

Core claim

PorTEXTO is presented as the first benchmark for contemporary European Portuguese visual text extraction. An annotation process that starts with frontier large vision-language model transcriptions and adds exhaustive native-speaker review supplies the ground truth. Evaluation across models reveals a sharp drop from synthetic to real samples and identifies specialized multilingual data as the stronger driver of pt-PT performance compared with model size or resolution budget, which motivates the release of open pt-PT OCR resources.

What carries the argument

The PorTEXTO benchmark dataset, which supplies paired real and synthetic pt-PT images for measuring visual text extraction accuracy.

If this is right

Specialized multilingual training data yields higher accuracy on pt-PT text than equivalent gains in model size.
Resolution increases alone do not close the performance gap observed on real pt-PT samples.
Synthetic-only training hides limitations that appear once models encounter authentic images.
Open release of pt-PT OCR resources follows directly from the observed data advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-over-scale pattern may appear in benchmarks for other underrepresented languages if similar evaluation splits are used.
The LVLM-plus-native-review annotation method could be reused to create comparable benchmarks without requiring large new annotation teams.
Document-processing applications in Portuguese-speaking regions stand to gain from models tuned on the released resources.

Load-bearing premise

The annotation pipeline that combines frontier LVLM transcriptions with exhaustive native-speaker review produces ground truth accurate enough for reliable model comparisons.

What would settle it

A model trained without specialized multilingual data but with substantially larger size or resolution achieving higher accuracy on PorTEXTO real-world samples than models that use such data would challenge the central claim.

Figures

Figures reproduced from arXiv: 2606.19096 by David Semedo, Diogo Gl\'oria-Silva, Diogo Tavares, Jo\~ao Cardeira, Jo\~ao Magalh\~aes, Manuel Letras da Luz, Rafael Ferreira.

**Figure 1.** Figure 1: Modern OCR applications. Contemporary OCR applications range from professional settings, such as automatic document processing, to casual and educational purposes, often involving user dialogue with an AI chat. rendering pipeline that composites pt-PT text over varied backgrounds. We describe our annotation pipeline, where image text transcripts are annotated using a frontier LVLM and exhaustively reviewe… view at source ↗

**Figure 2.** Figure 2: Handwritten Subsets. Representative samples from Full Page and Region Handwritten subsets of PorTEXTO. The word redacted is used purely for illustration purposes, signaling regions where human reviewers redacted poor text. These samples expose models to authentic handwriting, idiosyncratic ptPT abbreviations, mixed prose, and everyday orthography. Arabic-script language. Its Portuguese subset, however, is… view at source ↗

**Figure 3.** Figure 3: In-The-Wild Subset. Representative samples from In-The-Wild subset of PorTEXTO. These samples expose models to important real world and culturally meaningful concepts. It includes a wide variety of domains, such as school exams, common invoices, government forms, and also a variety of samples of casual nature, such as photographs of coffee shops and town signs, popular sports clubs, TV shows and memes, an… view at source ↗

**Figure 4.** Figure 4: Synthetic Subset. Representative samples from the Synthetic subset of PorTEXTO. These samples expose models to pure pt-PT text in common document layouts, rendered on top of general purpose images. Represents data more likely to be aligned with training data distribution of most models. models to authentic handwriting, idiosyncratic abbreviations, mixed prose, and the everyday orthography of contemporary p… view at source ↗

**Figure 5.** Figure 5: The platform used for annotation. Annotators can visualize the image (1); redact the image by drawing a square using the mouse pointer, and can reverse redactions (2); verify and correct transcriptions (3); validate or invalidate samples (4); and iterate to the next or the previous sample, as well as visualize progress (5). 3.2 Curation of Reference Transcriptions To enforce data quality, collected data wa… view at source ↗

**Figure 5.** Figure 5: Distributions of transcription length and source image resolution. The wide spread along both dimensions reflects the diversity of textual content and visual conditions covered by the benchmark. between annotators, increasing the likelihood that each sample was seen by a different reviewer from the first round, adding further guarantee that out-of-scope and poor samples are discarded. 3.3 Curated Dataset … view at source ↗

**Figure 6.** Figure 6: Distributions of transcription length and source image resolution. The wide spread along both dimensions reflects the diversity of textual content and visual conditions covered by the benchmark. sub-regions of text pertaining to unique subtopics. These sub-regions were obtained by prompting the same LVLM to return, for each page, an ordered list of bounding boxes corresponding to its main logical content … view at source ↗

**Figure 6.** Figure 6: OCRBench vs. PorTEXTO mean ANLS; both axes on a 0–100 scale. The dashed line is the OLS fit (R2 = 0.05). Gemma-4-E4B-it Qwen3.6-27B Qwen3.5-9B Gemma-4-31B-it HW (R) HW (FP) SYN ITW Mean +3.5 +10.2 +4.2 +1.8 +17.5 -1.4 -3.2 +6.2 +0.3 §0.0 -0.5 -0.1 +8.1 +4.0 +5.8 -1.5 +7.3 +3.2 +1.6 +1.6 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: OCRBench vs. PorTEXTO mean ANLS; both axes on a 0–100 scale. The dashed line is the OLS fit (R2 = 0.05). Gemma-4-E4B-it Qwen3.6-27B Qwen3.5-9B Gemma-4-31B-it HW (R) HW (FP) SYN ITW Mean +3.5 +10.2 +4.2 +1.8 +17.5 -1.4 -3.2 +6.2 +0.3 §0.0 -0.5 -0.1 +8.1 +4.0 +5.8 -1.5 +7.3 +3.2 +1.6 +1.6 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PorTEXTO gives the field a needed modern benchmark for pt-PT OCR, but the lack of annotation validation metrics weakens the support for its performance claims.

read the letter

PorTEXTO gives the field a needed modern benchmark for pt-PT OCR, but the lack of annotation validation metrics weakens the support for its performance claims.

The work introduces PorTEXTO as the first benchmark targeting contemporary and culturally relevant European Portuguese visual text. Prior resources either ignore pt-PT or focus on historical documents and literature. The authors generate initial transcriptions with a frontier large vision-language model and then have native speakers do exhaustive review. They evaluate models on both synthetic and real samples, noting a sharp drop in performance on the real ones. They also compare factors and conclude that specialized multilingual data drives better pt-PT results than simply using larger models or higher resolution.

The paper does a good job of spotting the language gap and producing a new resource that can be used for evaluation. Releasing it openly is helpful for the community working on OCR in less-covered languages.

The main concern is around the ground truth quality. The pipeline sounds reasonable on paper, but without any reported metrics like inter-annotator agreement rates, percentage of corrections made by reviewers, or even a small validation set checked independently, it's difficult to gauge how reliable the labels are. If the LVLM introduces consistent biases that the review doesn't fully catch, that could affect the measured gaps between synthetic and real data as well as the relative importance of data specialization versus model scale. The stress test note on this point seems on target given what's described.

This kind of paper is mainly for practitioners and researchers in computer vision who need benchmarks for specific languages or who are studying multilingual model behavior. Someone building or fine-tuning OCR systems for Portuguese would get direct value from the dataset and the reported baselines.

Overall, the paper shows clear thinking in addressing an underserved area. It should go to peer review so that the community can assess the data and suggest improvements to the validation steps.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PorTEXTO, the first benchmark for contemporary European Portuguese (pt-PT) visual text extraction in modern OCR applications. It employs an annotation pipeline of frontier LVLM transcriptions followed by exhaustive native-speaker review, reports a sharp performance drop from synthetic to real-world samples across models, and concludes that specialized multilingual data currently drives pt-PT performance more effectively than model size or resolution budget, motivating the release of open pt-PT OCR resources.

Significance. If the ground-truth quality can be quantitatively validated, the work would usefully demonstrate the value of language-specific data over scaling for low-resource languages in vision-language tasks and provide a needed modern benchmark to complement existing historical pt-PT resources. The explicit release of open resources is a concrete positive contribution.

major comments (2)

[Annotation Pipeline] Annotation Pipeline section: the pipeline is asserted to produce high-quality labels via LVLM transcription plus native-speaker review, yet no quantitative validation is reported (inter-annotator agreement, sampled error rates, or disagreement-resolution protocol). Because the central claims rest on measured gaps between synthetic and real samples and on the relative importance of data versus scale, the absence of these checks is load-bearing.
[Results and Evaluation] Results and Evaluation sections: the abstract states a performance drop and the superiority of specialized multilingual data, but the manuscript provides neither dataset statistics (e.g., number of real vs. synthetic samples, character distributions) nor detailed ablation tables separating data specialization from model size and resolution. Without these, the comparative claim cannot be verified.

minor comments (1)

[Figures] Figure captions and axis labels could more explicitly distinguish synthetic from real-world test sets to aid quick interpretation of the reported drop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional quantitative support would strengthen the manuscript. We address each point below and commit to revisions that incorporate the requested elements.

read point-by-point responses

Referee: [Annotation Pipeline] Annotation Pipeline section: the pipeline is asserted to produce high-quality labels via LVLM transcription plus native-speaker review, yet no quantitative validation is reported (inter-annotator agreement, sampled error rates, or disagreement-resolution protocol). Because the central claims rest on measured gaps between synthetic and real samples and on the relative importance of data versus scale, the absence of these checks is load-bearing.

Authors: We agree that the absence of quantitative validation metrics is a limitation. Although the pipeline combines frontier LVLM transcriptions with exhaustive native-speaker review, we will add inter-annotator agreement scores on a sampled subset of annotations, sampled error rates, and a description of the disagreement-resolution protocol to the revised Annotation Pipeline section. revision: yes
Referee: [Results and Evaluation] Results and Evaluation sections: the abstract states a performance drop and the superiority of specialized multilingual data, but the manuscript provides neither dataset statistics (e.g., number of real vs. synthetic samples, character distributions) nor detailed ablation tables separating data specialization from model size and resolution. Without these, the comparative claim cannot be verified.

Authors: We acknowledge that explicit dataset statistics and detailed ablations are required for full verifiability of the claims. The manuscript contains comparative results, but we will add a dedicated table reporting the number of real versus synthetic samples, character distributions, and expanded ablation tables that isolate the effects of specialized multilingual data from model size and resolution in the revised Results and Evaluation sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark creation and evaluation are self-contained

full rationale

The paper introduces PorTEXTO as a new benchmark for pt-PT visual text extraction, describes an annotation pipeline (LVLM transcription plus native-speaker review), and reports empirical performance comparisons across models on synthetic vs. real samples. The central claim—that specialized multilingual data outperforms model size or resolution—is presented as an observation from these measurements rather than a derivation. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, uniqueness theorems, or ansatzes appear in the provided abstract or described structure. The evaluation chain consists of direct metric computation on newly created data and does not reduce to its inputs by construction. Absence of inter-annotator metrics is a potential evidence-quality concern but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no free parameters, invented entities, or additional axioms are described beyond the annotation pipeline assumption.

axioms (1)

domain assumption The combination of LVLM-generated transcriptions and native speaker review yields accurate annotations
This pipeline is presented as the basis for benchmark quality.

pith-pipeline@v0.9.1-grok · 5671 in / 1055 out tokens · 34043 ms · 2026-07-02T22:15:55.137376+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 10 canonical work pages · 5 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., et al.: LLaVA-OneVision-1.5: Fully open framework for democratized mul- timodal training. arXiv:2509.23661 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: ICCV (2019)

Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)

2019
[3]

Advances in Neural Information Processing Systems (2026)

Cho, J.H., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. Advances in Neural Information Processing Systems (2026)

2026
[4]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Clark, C., et al.: Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv:2601.10611 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

arXiv:2511.03929 (2025)

Deshmukh, A.S., et al.: Nvidia nemotron nano v2 vl. arXiv:2511.03929 (2025)

work page arXiv 2025
[6]

https://deepmind.google/models/ model-cards/gemini-3-1-pro/ (2026)

Google DeepMind: Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ (2026)

2026
[7]

https://ai.google.dev/gemma/docs/ core/model_card_4 (2026)

Google DeepMind: Gemma 4 model card. https://ai.google.dev/gemma/docs/ core/model_card_4 (2026)

2026
[8]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Hong, W., et al.: Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv:2507.01006 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv:2601.09668 (2026)

Huang, A., et al.: Step3-vl-10b technical report. arXiv:2601.09668 (2026)

work page arXiv 2026
[10]

In: ECCV (2022)

Kim, G., et al.: OCR-free document understanding transformer. In: ECCV (2022)

2022
[11]

In: MT (2005), https://www.statmt.org/europarl/

Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT (2005), https://www.statmt.org/europarl/

2005
[12]

IJCV (2020)

Kuznetsova, A., et al.: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV (2020)

2020
[13]

Ministral 3

Liu, A.H., et al.: Ministral 3. arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

NVIDIA, https://huggingface.co/blog/nvidia/nemotron-ocr-v2 (2026)

Liu, B., et al.: Building a fast multilingual OCR model with synthetic data. NVIDIA, https://huggingface.co/blog/nvidia/nemotron-ocr-v2 (2026)

2026
[15]

Science China Information Sciences (2024)

Liu, Y., et al.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences (2024)

2024
[16]

In: CVPR (2019)

Marino, K., et al.: OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR (2019)

2019
[17]

In: WACV (2021)

Mathew, M., et al.: DocVQA: A dataset for VQA on document images. In: WACV (2021)

2021
[18]

In: WACV (2022) 12 J

Mathew, M., et al.: InfographicVQA. In: WACV (2022) 12 J. Cardeira et al

2022
[19]

Journal of Imaging (2025)

Matos, A., et al.: iForal: Automated handwritten text transcription for historical medieval manuscripts. Journal of Imaging (2025)

2025
[20]

Hugging Face, https://huggingface.co/ datasets/mazafard/portuguese-ocr-dataset (2025)

mazafard: Portuguese OCR dataset. Hugging Face, https://huggingface.co/ datasets/mazafard/portuguese-ocr-dataset (2025)

2025
[21]

In: ICDAR (2024)

Neto, A.F.S., et al.: BRESSAY: A Brazilian Portuguese dataset for offline hand- written text recognition. In: ICDAR (2024)

2024
[22]

In: SIGIR (2021)

de Oliveira, L.L., et al.: REGIS: A test collection for geoscientific documents in Portuguese. In: SIGIR (2021)

2021
[23]

In: CIKM (2025)

Osório, T.F., et al.: Portuguese post-OCR resources for text optimisation. In: CIKM (2025)

2025
[24]

In: CVPR (2025)

Ouyang, L., et al.: OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. In: CVPR (2025)

2025
[25]

In: ACL (2002)

Papineni, K., et al.: BLEU: a method for automatic evaluation of machine trans- lation. In: ACL (2002)

2002
[26]

INESC TEC, https://episa

Project, E.: Typewritten digital representations of Portuguese cultural heritage documents from the 20th century (EPISA dataset). INESC TEC, https://episa. inesctec.pt/outcomes/ (2022)

2022
[27]

https://qwen.ai/blog? id=qwen3.5 (2026)

Qwen Team: Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog? id=qwen3.5 (2026)

2026
[28]

https: //qwen.ai/blog?id=qwen3.6-27b (2026)

Qwen Team: Qwen3.6-27B: Flagship-level coding in a 27B dense model. https: //qwen.ai/blog?id=qwen3.6-27b (2026)

2026
[29]

In: ICDAR (2023)

Santos, M.K., et al.: ESTER-Pt: An evaluation suite for TExt recognition in Por- tuguese. In: ICDAR (2023)

2023
[30]

In: CVPR (2019)

Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

2019
[31]

arXiv:2406.13469 (2024)

Smart, D.S., et al.: Encoder vs decoder: Comparative analysis of encoder and decoder language models on multilingual nlu tasks. arXiv:2406.13469 (2024)

work page arXiv 2024
[32]

In: ACL Findings (2025)

Tang, J., et al.: MTVQA: Benchmarking multilingual text-centric visual question answering. In: ACL Findings (2025)

2025
[33]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., et al.: Internvl3. 5: Advancing open-source multimodal models in ver- satility, reasoning, and efficiency. arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

arXiv preprint arXiv:2601.20552 (2026)

Wei, H., et al.: Deepseek-ocr 2: Visual causal flow. arXiv:2601.20552 (2026)

work page arXiv 2026
[35]

HuggingFace Datasets (2023), https://huggingface.co/datasets/wikimedia/wikipedia

Wikimedia Foundation: Wikimedia wikipedia (20231101.pt). HuggingFace Datasets (2023), https://huggingface.co/datasets/wikimedia/wikipedia

2023
[36]

In: ICCV (2025)

Yang, Z., et al.: CC-OCR: A comprehensive and challenging OCR benchmark for evaluating large multimodal models in literacy. In: ICCV (2025)

2025
[37]

In: NAACL (2025)

Zhang, K., et al.: Lmms-eval: Reality check on the evaluation of large multimodal models. In: NAACL (2025)

2025
[38]

arXiv:2510.13795 (2025)

Zhang, Y., et al.: Bee: A high-quality corpus and full-stack suite to unlock advanced fully open MLLMs. arXiv:2510.13795 (2025)

work page arXiv 2025

[1] [1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., et al.: LLaVA-OneVision-1.5: Fully open framework for democratized mul- timodal training. arXiv:2509.23661 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In: ICCV (2019)

Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)

2019

[3] [3]

Advances in Neural Information Processing Systems (2026)

Cho, J.H., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. Advances in Neural Information Processing Systems (2026)

2026

[4] [4]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Clark, C., et al.: Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv:2601.10611 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

arXiv:2511.03929 (2025)

Deshmukh, A.S., et al.: Nvidia nemotron nano v2 vl. arXiv:2511.03929 (2025)

work page arXiv 2025

[6] [6]

https://deepmind.google/models/ model-cards/gemini-3-1-pro/ (2026)

Google DeepMind: Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ (2026)

2026

[7] [7]

https://ai.google.dev/gemma/docs/ core/model_card_4 (2026)

Google DeepMind: Gemma 4 model card. https://ai.google.dev/gemma/docs/ core/model_card_4 (2026)

2026

[8] [8]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Hong, W., et al.: Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv:2507.01006 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv:2601.09668 (2026)

Huang, A., et al.: Step3-vl-10b technical report. arXiv:2601.09668 (2026)

work page arXiv 2026

[10] [10]

In: ECCV (2022)

Kim, G., et al.: OCR-free document understanding transformer. In: ECCV (2022)

2022

[11] [11]

In: MT (2005), https://www.statmt.org/europarl/

Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT (2005), https://www.statmt.org/europarl/

2005

[12] [12]

IJCV (2020)

Kuznetsova, A., et al.: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV (2020)

2020

[13] [13]

Ministral 3

Liu, A.H., et al.: Ministral 3. arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

NVIDIA, https://huggingface.co/blog/nvidia/nemotron-ocr-v2 (2026)

Liu, B., et al.: Building a fast multilingual OCR model with synthetic data. NVIDIA, https://huggingface.co/blog/nvidia/nemotron-ocr-v2 (2026)

2026

[15] [15]

Science China Information Sciences (2024)

Liu, Y., et al.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences (2024)

2024

[16] [16]

In: CVPR (2019)

Marino, K., et al.: OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR (2019)

2019

[17] [17]

In: WACV (2021)

Mathew, M., et al.: DocVQA: A dataset for VQA on document images. In: WACV (2021)

2021

[18] [18]

In: WACV (2022) 12 J

Mathew, M., et al.: InfographicVQA. In: WACV (2022) 12 J. Cardeira et al

2022

[19] [19]

Journal of Imaging (2025)

Matos, A., et al.: iForal: Automated handwritten text transcription for historical medieval manuscripts. Journal of Imaging (2025)

2025

[20] [20]

Hugging Face, https://huggingface.co/ datasets/mazafard/portuguese-ocr-dataset (2025)

mazafard: Portuguese OCR dataset. Hugging Face, https://huggingface.co/ datasets/mazafard/portuguese-ocr-dataset (2025)

2025

[21] [21]

In: ICDAR (2024)

Neto, A.F.S., et al.: BRESSAY: A Brazilian Portuguese dataset for offline hand- written text recognition. In: ICDAR (2024)

2024

[22] [22]

In: SIGIR (2021)

de Oliveira, L.L., et al.: REGIS: A test collection for geoscientific documents in Portuguese. In: SIGIR (2021)

2021

[23] [23]

In: CIKM (2025)

Osório, T.F., et al.: Portuguese post-OCR resources for text optimisation. In: CIKM (2025)

2025

[24] [24]

In: CVPR (2025)

Ouyang, L., et al.: OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. In: CVPR (2025)

2025

[25] [25]

In: ACL (2002)

Papineni, K., et al.: BLEU: a method for automatic evaluation of machine trans- lation. In: ACL (2002)

2002

[26] [26]

INESC TEC, https://episa

Project, E.: Typewritten digital representations of Portuguese cultural heritage documents from the 20th century (EPISA dataset). INESC TEC, https://episa. inesctec.pt/outcomes/ (2022)

2022

[27] [27]

https://qwen.ai/blog? id=qwen3.5 (2026)

Qwen Team: Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog? id=qwen3.5 (2026)

2026

[28] [28]

https: //qwen.ai/blog?id=qwen3.6-27b (2026)

Qwen Team: Qwen3.6-27B: Flagship-level coding in a 27B dense model. https: //qwen.ai/blog?id=qwen3.6-27b (2026)

2026

[29] [29]

In: ICDAR (2023)

Santos, M.K., et al.: ESTER-Pt: An evaluation suite for TExt recognition in Por- tuguese. In: ICDAR (2023)

2023

[30] [30]

In: CVPR (2019)

Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

2019

[31] [31]

arXiv:2406.13469 (2024)

Smart, D.S., et al.: Encoder vs decoder: Comparative analysis of encoder and decoder language models on multilingual nlu tasks. arXiv:2406.13469 (2024)

work page arXiv 2024

[32] [32]

In: ACL Findings (2025)

Tang, J., et al.: MTVQA: Benchmarking multilingual text-centric visual question answering. In: ACL Findings (2025)

2025

[33] [33]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., et al.: Internvl3. 5: Advancing open-source multimodal models in ver- satility, reasoning, and efficiency. arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

arXiv preprint arXiv:2601.20552 (2026)

Wei, H., et al.: Deepseek-ocr 2: Visual causal flow. arXiv:2601.20552 (2026)

work page arXiv 2026

[35] [35]

HuggingFace Datasets (2023), https://huggingface.co/datasets/wikimedia/wikipedia

Wikimedia Foundation: Wikimedia wikipedia (20231101.pt). HuggingFace Datasets (2023), https://huggingface.co/datasets/wikimedia/wikipedia

2023

[36] [36]

In: ICCV (2025)

Yang, Z., et al.: CC-OCR: A comprehensive and challenging OCR benchmark for evaluating large multimodal models in literacy. In: ICCV (2025)

2025

[37] [37]

In: NAACL (2025)

Zhang, K., et al.: Lmms-eval: Reality check on the evaluation of large multimodal models. In: NAACL (2025)

2025

[38] [38]

arXiv:2510.13795 (2025)

Zhang, Y., et al.: Bee: A high-quality corpus and full-stack suite to unlock advanced fully open MLLMs. arXiv:2510.13795 (2025)

work page arXiv 2025